← Echo Blog

Seed-ASR 2.0 vs Whisper: Which Wins for Voice Input?

By Xiang · April 16, 2026 · 6 min read

When I was choosing an ASR engine for Echo, I spent two weeks benchmarking. Whisper is the default choice for most indie voice apps. But there's a newer contender from ByteDance that deserves attention: Seed-ASR 2.0 (marketed as "Doubao").

Here's what I found.

The Headline Numbers

MetricWhisper Large-v3Seed-ASR 2.0
Parameters1.55B~Not disclosed (MoE)
Languages99 languages13 languages + dialects
Chinese WER (AIShell)~5.5%~3.2%
English WER (LibriSpeech)~3.0%~3.5%
Mixed CN-ENDecentExcellent
Streaming (live)❌ Batch only✅ 300-400ms first partial
Context/keywordsGeneric+20% keyword recall (hot words API)
Open source✅ Yes❌ API only
Pricing~$0.006/min (OpenAI API)~$0.013/hour (Volcano)
Can run on-device✅ Yes (500MB Small)❌ Cloud only

Where Seed-ASR 2.0 Wins

1. Chinese Accuracy

This is the biggest gap. Whisper was trained on internet-scraped audio with heavy English bias. Seed-ASR was trained by ByteDance — the company behind Douyin (Chinese TikTok) — with massive native Chinese speech data. For Mandarin, Cantonese, and regional dialects (Sichuan, Shanghainese), Seed wins decisively.

2. Streaming Latency

Whisper is a batch model. It was never designed for real-time transcription — you send an audio chunk, you get a result. For live voice input, this means you wait 500ms-2s after you stop speaking before seeing text.

Seed-ASR has a dedicated streaming mode (volc.seedasr.sauc.duration) with ~300-400ms first-character latency. Text appears as you speak, like captions. For voice input UX, this is a completely different feel.

3. Mixed-Language Code-Switching

When I say "帮我发个 email 给 team", Whisper often transcribes the English words phonetically in Chinese characters (发个伊妹儿给替姆). Seed-ASR preserves English words as English. For bilingual users, this alone is a dealbreaker.

4. Pricing

OpenAI charges $0.006 per minute for Whisper API (~$0.36/hour). Volcano charges about ~$0.013/hour for Seed-ASR 2.0 — ~27x cheaper per hour. This matters if you're running a voice app at scale.

Where Whisper Wins

1. English Accuracy

For clean English speech, Whisper Large-v3 still edges out Seed-ASR 2.0 slightly. Not by much, but if your users are 100% English, Whisper is a safe choice.

2. Language Breadth

Whisper handles 99 languages. Seed-ASR supports 13. If you need Arabic, Hindi, or Swahili, Whisper is your only option.

3. Open Source

Whisper models are downloadable and MIT-licensed. You can run them on-device via CoreML (Small = 500MB) or your own servers. Seed-ASR is API-only.

4. Offline Capability

If you need offline voice recognition (privacy-sensitive apps, field work, flights), Whisper runs locally. Seed-ASR requires a network connection.

Why Echo Uses Seed-ASR 2.0 (with BigASR fallback)

Our users are heavily Chinese-English bilingual. Streaming UX is a first-class priority (text should appear as you speak). Cost matters because we offer a free tier.

Seed-ASR 2.0 wins on all three axes for our audience.

That said, we keep BigASR 1.0 as an admin-toggleable fallback for A/B testing, and Whisper remains on the roadmap for an on-device fallback that Pro users can enable for offline privacy.

Our current architecture:

The Honest Verdict

If you're building a voice app and need to pick one ASR today:

Whisper is still the gold standard for accuracy on clean English. Seed-ASR wins on every dimension that matters for real-world multilingual voice input apps in 2026.

Try Echo

Curious how Seed-ASR feels in practice? Echo uses it by default. Free to download.

Download Echo (iOS) →

Questions? DM me on X @EchoVoiceApp.

— Xiang, solo maker of Echo