Deepgram is the speech-to-text platform optimised for real-time applications — agent-assist surfaces during live customer calls, voice search inside ecommerce apps, real-time meeting transcription, live captioning. Where AssemblyAI leans into post-call audio intelligence, Deepgram leans into low-latency streaming, and their Nova-2 model is consistently cited as the fastest commercially-available real-time transcription at production accuracy.
What it actually does for ecommerce sellers
Deepgram is an API platform that converts speech to text, in real time or batch, with industry-leading latency on the streaming path. The Nova family of models (Nova-2 is the production default in 2026) handle 30+ languages, accent-resilient transcription, custom vocabulary support for product names and brand terms, and speaker diarisation. The platform’s real-time mode delivers transcripts with sub-300ms latency under typical conditions, which is the difference between a usable live agent-assist surface and a frustrating one.
For ecommerce specifically, Deepgram tends to slot in behind the scenes rather than be a tool teams interact with directly. It powers live transcript display inside customer support helpdesks (so agents can read what customers are saying during a call), voice search inside Shopify apps and BigCommerce extensions, automatic captioning on product video content, and the speech recognition layer beneath voice agent platforms like Vapi. The 2024-2025 platform additions include the Aura text-to-speech model (Deepgram now ships full voice AI infrastructure, not just STT), an LLM-powered summarisation layer for batch transcripts, and tighter integration with the AI agent platforms that have emerged this cycle.
Best for
- Customer support tooling — live transcripts surfaced inside helpdesk interfaces during phone calls, where latency must stay under perceptual thresholds.
- Voice search and voice navigation — embedded in ecommerce apps where users search by voice on mobile.
- Live captioning — events, product launches, accessibility-mandated captioning of live video content.
- Voice agent infrastructure — STT layer beneath custom voice agents built on Vapi or similar platforms, where every millisecond of latency reduces conversation quality.
It is not the right tool for shops that want a finished product (it’s an API, not a service), nor for batch audio analytics where AssemblyAI’s audio intelligence layer is more comprehensive. For pure post-call analysis, AssemblyAI is the better fit.
Pricing breakdown
Deepgram prices per audio minute. The Nova-2 model costs around £0.0035 per minute for batch transcription and £0.0042 per minute for real-time streaming. For a shop processing 1,000 audio hours monthly (60,000 minutes), expect roughly £210-£250 monthly cost. Add-on features (diarisation, custom vocabulary, summarisation) tier on top.
The pricing model is consumption-based with committed-usage discounts above £500/month. There’s a free tier ($200 in credit on signup) that’s enough to validate accuracy and latency against your specific use case before committing. The economics are favourable compared to AssemblyAI for high-volume use cases, particularly real-time streaming, but slightly less generous on the audio intelligence side.
Where it falls short
The biggest weakness is the same as AssemblyAI’s: it’s a developer platform, not a finished product. Operations teams without engineering capacity can’t deploy Deepgram directly; they need a third-party tool that wraps the API in a workflow. For shops that want a self-service experience, look at the helpdesks (Gorgias, Zendesk) and contact-centre platforms that have integrated Deepgram-class STT under the hood rather than buying the API directly.
The audio intelligence layer is less developed than AssemblyAI’s. Deepgram’s strengths are accuracy and latency on transcription itself; the surrounding analytics (sentiment, topic detection, summarisation) are functional but more limited. For workflows where rich post-call insight matters more than raw transcription speed, AssemblyAI is the better fit.
Custom vocabulary support is solid for brand names and common product terms but degrades on heavily technical or industry-specific jargon. Shops with niche terminology should test thoroughly with their actual audio before committing — accuracy on standard ecommerce terms is excellent, accuracy on industry-specific shorthand can be inconsistent.
Finally, the platform’s real-time streaming, while industry-leading on latency, is sensitive to network conditions. Deployments that route audio over public internet from constrained networks (mobile-only customers, international audio) sometimes see latency spikes that affect agent-assist usefulness. Production deployments should design for graceful degradation when streaming latency drifts above acceptable thresholds.
Compared to AssemblyAI
The split mirrors what we covered in the AssemblyAI review: Deepgram for real-time streaming and the lowest latency, AssemblyAI for batch processing and rich audio intelligence. Deepgram’s Nova-2 typically beats AssemblyAI’s standard model on streaming latency by a meaningful margin; AssemblyAI’s audio intelligence layer goes deeper than Deepgram’s analytics. Many serious operations use both — Deepgram for the live agent-assist surface, AssemblyAI for the weekly post-call analysis. The decision rule: if latency under 500ms matters, Deepgram; if rich insight on completed audio matters, AssemblyAI.
Our take
For ecommerce shops building real-time voice features into their stack — live transcripts in support tooling, voice search, voice agents that need crisp conversation feel — Deepgram is the strongest STT platform on the market in 2026. The Nova-2 latency genuinely matters: agent-assist surfaces with 200ms transcription feel usable, surfaces with 800ms transcription feel broken. The barrier is real (it’s an API, not a product) but for shops with engineering or with vendors who’ve built on top of Deepgram, the speech-to-text foundation is rock solid. For batch audio analytics where rich post-call insight matters more than streaming latency, AssemblyAI is the better choice. Most production deployments end up running both, with each platform owning the use case it’s optimised for rather than asking either to handle the full audio stack.
FAQ
How fast is Deepgram’s real-time transcription?
Sub-300ms latency under typical conditions for the Nova-2 model in real-time streaming mode. Latency is sensitive to network conditions and audio quality; mobile-network audio over constrained connections can occasionally see spikes above 500ms. Most production deployments design for graceful degradation when latency drifts.
Does Deepgram support British English well?
Yes — British English is a primary supported language and accuracy on standard accents is comparable to US English. Heavy regional dialects get less accurate; the Nova-2 model improves accent resilience but does not fully close the gap on broad regional accents.
Does Deepgram do text-to-speech as well as speech-to-text?
Yes — the Aura model added in 2024 provides text-to-speech alongside the established STT capability. For ecommerce shops building voice features, this means a single vendor relationship covers both directions of voice AI rather than mixing Deepgram for STT with ElevenLabs or PlayHT for TTS. Aura’s voice quality is solid but not class-leading; ElevenLabs still wins on premium voice quality, but Aura is more than acceptable for most production use cases.