When I was choosing an ASR engine for Echo, I spent two weeks benchmarking. Whisper is the default choice for most indie voice apps. But there's a newer contender from ByteDance that deserves attention: Seed-ASR 2.0 (marketed as "Doubao").
Here's what I found.
| Metric | Whisper Large-v3 | Seed-ASR 2.0 |
|---|---|---|
| Parameters | 1.55B | ~Not disclosed (MoE) |
| Languages | 99 languages | 13 languages + dialects |
| Chinese WER (AIShell) | ~5.5% | ~3.2% |
| English WER (LibriSpeech) | ~3.0% | ~3.5% |
| Mixed CN-EN | Decent | Excellent |
| Streaming (live) | ❌ Batch only | ✅ 300-400ms first partial |
| Context/keywords | Generic | +20% keyword recall (hot words API) |
| Open source | ✅ Yes | ❌ API only |
| Pricing | ~$0.006/min (OpenAI API) | ~$0.013/hour (Volcano) |
| Can run on-device | ✅ Yes (500MB Small) | ❌ Cloud only |
This is the biggest gap. Whisper was trained on internet-scraped audio with heavy English bias. Seed-ASR was trained by ByteDance — the company behind Douyin (Chinese TikTok) — with massive native Chinese speech data. For Mandarin, Cantonese, and regional dialects (Sichuan, Shanghainese), Seed wins decisively.
Whisper is a batch model. It was never designed for real-time transcription — you send an audio chunk, you get a result. For live voice input, this means you wait 500ms-2s after you stop speaking before seeing text.
Seed-ASR has a dedicated streaming mode (volc.seedasr.sauc.duration) with ~300-400ms first-character latency. Text appears as you speak, like captions. For voice input UX, this is a completely different feel.
When I say "帮我发个 email 给 team", Whisper often transcribes the English words phonetically in Chinese characters (发个伊妹儿给替姆). Seed-ASR preserves English words as English. For bilingual users, this alone is a dealbreaker.
OpenAI charges $0.006 per minute for Whisper API (~$0.36/hour). Volcano charges about ~$0.013/hour for Seed-ASR 2.0 — ~27x cheaper per hour. This matters if you're running a voice app at scale.
For clean English speech, Whisper Large-v3 still edges out Seed-ASR 2.0 slightly. Not by much, but if your users are 100% English, Whisper is a safe choice.
Whisper handles 99 languages. Seed-ASR supports 13. If you need Arabic, Hindi, or Swahili, Whisper is your only option.
Whisper models are downloadable and MIT-licensed. You can run them on-device via CoreML (Small = 500MB) or your own servers. Seed-ASR is API-only.
If you need offline voice recognition (privacy-sensitive apps, field work, flights), Whisper runs locally. Seed-ASR requires a network connection.
Our users are heavily Chinese-English bilingual. Streaming UX is a first-class priority (text should appear as you speak). Cost matters because we offer a free tier.
Seed-ASR 2.0 wins on all three axes for our audience.
That said, we keep BigASR 1.0 as an admin-toggleable fallback for A/B testing, and Whisper remains on the roadmap for an on-device fallback that Pro users can enable for offline privacy.
Our current architecture:
volc.seedasr.sauc.duration — 300ms partialvolc.seedasr.auc — high-accuracy submit/queryIf you're building a voice app and need to pick one ASR today:
Whisper is still the gold standard for accuracy on clean English. Seed-ASR wins on every dimension that matters for real-world multilingual voice input apps in 2026.
Curious how Seed-ASR feels in practice? Echo uses it by default. Free to download.
Download Echo (iOS) →Questions? DM me on X @EchoVoiceApp.
— Xiang, solo maker of Echo