Pro Voice Agents

Production voice AI architecture for building real-time conversational agents — covering latency optimization, speech-to-text, text-to-speech, turn-taking, and the two core architecture patterns.

When to Use

Build voice agents when:

Need real-time voice conversations (customer support, virtual assistants)
Phone-based interactions requiring natural turn-taking
Applications where voice is the primary interface
Replacing IVR systems with intelligent voice bots

Use text-based interfaces when:

Latency tolerance is low but network conditions vary
Users prefer typing (chat support)
Complex data input (forms, code)

Quick Start

Architecture 1: Speech-to-Speech (Lowest Latency)


# End-to-end model: audio in → audio out
# Latency: 200-500ms total

from voice_agent import SpeechToSpeechAgent

agent = SpeechToSpeechAgent(
    model="gpt-4o-realtime",
    voice="alloy",
    turn_detection="server_vad",  # Voice Activity Detection
    temperature=0.8,
)

# WebSocket connection
async def handle_audio_stream(websocket):
    async for audio_chunk in websocket:
        response_audio = await agent.process(audio_chunk)
        await websocket.send(response_audio)

Architecture 2: Pipeline (Most Flexible)


# STT → LLM → TTS pipeline
# Latency: 500-1500ms total

from voice_agent import PipelineAgent

agent = PipelineAgent(
    stt_model="whisper-large-v3",       # Speech-to-text
    llm_model="claude-sonnet-4-20250514",              # Language model
    tts_model="elevenlabs-turbo-v2.5",  # Text-to-speech
    streaming=True,                      # Stream TTS as LLM generates
)

async def handle_call(audio_stream):
    transcript = await agent.stt.transcribe(audio_stream)
    response_text = await agent.llm.generate(transcript)
    response_audio = await agent.tts.synthesize(response_text)
    return response_audio

Core Concepts

Latency Budget

Component	Target	Typical
Speech-to-Text	< 200ms	100-300ms
LLM Processing	< 500ms	200-1000ms
Text-to-Speech	< 200ms	100-500ms
Network	< 100ms	20-100ms
Total Pipeline	< 1000ms	500-1500ms
Speech-to-Speech	< 500ms	200-500ms

Architecture Comparison

Feature	Speech-to-Speech	Pipeline
Latency	200-500ms	500-1500ms
Flexibility	Limited	High
Voice quality	Native	Customizable
Language support	English-focused	Multilingual
Cost	Higher (single model)	Modular pricing
Customization	Low	High

Turn-Taking


# Voice Activity Detection (VAD) settings
vad_config = {
    "type": "server_vad",
    "threshold": 0.5,           # Sensitivity (0-1)
    "prefix_padding_ms": 300,   # Audio before speech detected
    "silence_duration_ms": 500, # Silence before turn ends
}

# Interruption handling
agent.on_interruption = "cancel_and_listen"  # Stop speaking, listen
# Options: "cancel_and_listen", "ignore", "queue"

Configuration

Parameter	Default	Description
`architecture`	"pipeline"	speech-to-speech or pipeline
`stt_model`	"whisper-large-v3"	Speech recognition model
`llm_model`	"claude-sonnet-4-20250514"	Language model
`tts_model`	"elevenlabs-turbo-v2.5"	Speech synthesis model
`streaming`	True	Stream TTS during LLM generation
`vad_threshold`	0.5	Voice detection sensitivity
`silence_duration_ms`	500	Silence before turn ends
`max_turn_duration_s`	30	Maximum speaking turn

Best Practices

Target < 1 second total latency — users perceive > 1.5s as unresponsive
Stream TTS alongside LLM generation — don't wait for the full response before speaking
Use server-side VAD — more reliable than client-side for turn detection
Handle interruptions gracefully — stop speaking and listen when the user interrupts
Optimize for first-byte latency — the time to first audio is more important than total duration
Add filler words for long processing — "Let me check..." while LLM generates

Common Issues

High latency breaks conversation flow: Profile each component separately. Use streaming TTS. Consider speech-to-speech architecture for lowest latency. Pre-warm models to eliminate cold start.

Poor speech recognition accuracy: Use domain-specific STT models. Add keyword boosting for proper nouns. Implement real-time transcription correction.

Unnatural turn-taking: Tune VAD silence duration — too short causes interruptions, too long adds latency. Add backchanneling ("mm-hmm", "I see") during user speech.

⚠️ Loading Issue

Pro Voice Agents

Pro Voice Agents

When to Use

Quick Start

Architecture 1: Speech-to-Speech (Lowest Latency)

Architecture 2: Pipeline (Most Flexible)

Core Concepts

Latency Budget

Architecture Comparison

Turn-Taking

Configuration

Best Practices

Common Issues

Reviews

Write a review

Similar Templates

Full-Stack Code Reviewer

Test Suite Generator

Pro Architecture Workspace