Lesson 6 · 10 min
Audio — ASR and TTS in production
Whisper-class speech-to-text is solved enough to ship; high-quality text-to-speech is too. The decisions that actually matter: streaming, diarization, voice cloning ethics.
ASR is mostly solved
Whisper, AssemblyAI, Deepgram, and several open-source variants give you sub-5% word-error rates on clean English speech and acceptable performance on 50+ other languages. The interesting choices in 2026 are operational, not model-quality:
- Streaming vs batch. Streaming for live transcription (meeting tools, support agent assist). Batch for archives (podcast indexing, recorded calls).
- Diarization — who said what. Critical for support-call transcripts; nice-to-have for monologues.
- Forced alignment — word-level timestamps. Required for video captioning, optional otherwise.
- PII redaction — built-in for some providers (Deepgram), bring-your-own for others.