“Deepgram nova-3 returned a labeled 21-min call in ~14s; Whisper is banned now”
I record a lot of calls, and for a long time the transcription step was the slow, annoying part of the workflow. I had a Whisper CLI wired up, and every call meant kicking the job off and walking away, because on a real recording it ran for minutes. Worse, the output was one undifferentiated wall of text — no idea who said what — so I'd end up scrubbing back through the audio to untangle the two voices before the transcript was even usable. For a tool whose whole point is saving me from re-listening to the recording, that was exactly backwards. I switched the whole thing over to Deepgram's nova-3 model with diarization turned on, and it changed how I work. What sold me was a concrete test, not a benchmark in some blog post. I took one real call — a single 21-minute m4a sitting in my Downloads — and ran it through nova-3. It came back in about 14 seconds. Not "fast for a transcription job," but genuinely 14 seconds for 21 minutes of audio, and it returned clean speaker-labeled JSON with the two people on the call separated correctly. I didn't have to fix the speaker boundaries by hand or guess at the turn changes; the diarization just got it right on the first pass, which is the part I'd assumed I'd always have to babysit. For comparison I ran the exact same file through Whisper. It was roughly 50x slower on identical input — and after all that waiting it still handed me zero speaker labels, so I'd have been right back to manually figuring out who said what. The trade wasn't even close: Deepgram was both dramatically faster and produced the one piece of structure I actually needed. There is no version of that comparison where the slow tool wins, and I genuinely tried to be fair to the incumbent before I ripped it out. That result is now codified, not a one-off impression I'm running on vibes. I baked Deepgram nova-3 into a transcribe-call skill that my agent uses as the default for any diarized call transcription, and I explicitly BANNED Whisper for call audio inside it. The reasoning is written right into the skill so future-me can't forget it: Whisper produces no speaker labels and ran about 50x slower on the same recording, so there is no scenario where reaching for it on a call still makes sense. Encoding a vendor choice that hard into my own tooling is something I only do when a tool has genuinely earned it, and nova-3 earned it on the first file I threw at it. What I appreciate as a builder is that nova-3 turned transcription from a step I dreaded into one I don't think about. I drop an m4a in, the script calls Deepgram, and a labeled transcript comes back before I've even switched windows. The speaker-labeled JSON is the part that keeps paying off downstream: it gives me discrete utterances I can post-process, relabel from a generic "Speaker 0" to a real name once I've identified the voices, and summarize — all without first reconstructing who was talking. Because the speaker structure is carried in the JSON itself, the raw output stays a clean source of truth even when I rewrite the readable transcript on top of it. The low latency is the quiet part that matters most: there's no "start it and go get coffee" anymore, no context switch, the result is simply there by the time I look back at the window. The other thing worth saying is how little babysitting the whole path needs. A 21-minute call returning in 14 seconds means I can transcribe right after hanging up and have a labeled record while the conversation is still fresh, instead of batching jobs overnight the way Whisper's runtime pushed me toward. For someone who lives in this loop every week, that latency difference compounds into real time saved. If you're transcribing calls and you need to know who said what, this is the bar. nova-3 with diarization is my default now, and the head-to-head against Whisper on a real file is exactly why. Five stars, and I don't hand those out for tools I've only kicked the tires on.
- No comments yet.