Voice-mode in 2025: lessons from a year of production deployments

Speech-to-speech models are everywhere now. The teams that succeeded share three patterns. The ones that didn't share three failures.

What worked

Constrained domains. Voice agents that handle 3-5 specific tasks (booking, order status, billing question) ship and work. Open-ended voice agents are still rough.
Fast interruption handling. The user can cut off the bot mid-sentence; the bot recovers. This is the biggest UX upgrade since 2024.
Hybrid escalation. "Let me transfer you to a human" working seamlessly when the agent is stuck. The fallback path is the product.

What failed

Open-ended assistants. Without constraints, voice agents wander. Users lose patience by turn 4.
Latency >800ms. Round-trip time matters more than text. Above 800ms, the conversation feels off; users hang up.
Pretending to be human. Users figure it out. The teams that disclose "I'm an AI assistant; I can transfer you to a human" up front have higher CSAT than the ones who don't.

Stack consensus

Most production deployments converged on: realtime API (OpenAI / Anthropic / Hume), small-talk filler tokens, async tool calls during pauses, hard 3-second timeout on tool calls, escalation to human after N failed turns.