Pro Voice Agents
Comprehensive skill designed for voice, agents, represent, frontier. Includes structured workflows, validation checks, and reusable patterns for ai research.
Pro Voice Agents
Production voice AI architecture for building real-time conversational agents — covering latency optimization, speech-to-text, text-to-speech, turn-taking, and the two core architecture patterns.
When to Use
Build voice agents when:
- Need real-time voice conversations (customer support, virtual assistants)
- Phone-based interactions requiring natural turn-taking
- Applications where voice is the primary interface
- Replacing IVR systems with intelligent voice bots
Use text-based interfaces when:
- Latency tolerance is low but network conditions vary
- Users prefer typing (chat support)
- Complex data input (forms, code)
Quick Start
Architecture 1: Speech-to-Speech (Lowest Latency)
# End-to-end model: audio in → audio out # Latency: 200-500ms total from voice_agent import SpeechToSpeechAgent agent = SpeechToSpeechAgent( model="gpt-4o-realtime", voice="alloy", turn_detection="server_vad", # Voice Activity Detection temperature=0.8, ) # WebSocket connection async def handle_audio_stream(websocket): async for audio_chunk in websocket: response_audio = await agent.process(audio_chunk) await websocket.send(response_audio)
Architecture 2: Pipeline (Most Flexible)
# STT → LLM → TTS pipeline # Latency: 500-1500ms total from voice_agent import PipelineAgent agent = PipelineAgent( stt_model="whisper-large-v3", # Speech-to-text llm_model="claude-sonnet-4-20250514", # Language model tts_model="elevenlabs-turbo-v2.5", # Text-to-speech streaming=True, # Stream TTS as LLM generates ) async def handle_call(audio_stream): transcript = await agent.stt.transcribe(audio_stream) response_text = await agent.llm.generate(transcript) response_audio = await agent.tts.synthesize(response_text) return response_audio
Core Concepts
Latency Budget
| Component | Target | Typical |
|---|---|---|
| Speech-to-Text | < 200ms | 100-300ms |
| LLM Processing | < 500ms | 200-1000ms |
| Text-to-Speech | < 200ms | 100-500ms |
| Network | < 100ms | 20-100ms |
| Total Pipeline | < 1000ms | 500-1500ms |
| Speech-to-Speech | < 500ms | 200-500ms |
Architecture Comparison
| Feature | Speech-to-Speech | Pipeline |
|---|---|---|
| Latency | 200-500ms | 500-1500ms |
| Flexibility | Limited | High |
| Voice quality | Native | Customizable |
| Language support | English-focused | Multilingual |
| Cost | Higher (single model) | Modular pricing |
| Customization | Low | High |
Turn-Taking
# Voice Activity Detection (VAD) settings vad_config = { "type": "server_vad", "threshold": 0.5, # Sensitivity (0-1) "prefix_padding_ms": 300, # Audio before speech detected "silence_duration_ms": 500, # Silence before turn ends } # Interruption handling agent.on_interruption = "cancel_and_listen" # Stop speaking, listen # Options: "cancel_and_listen", "ignore", "queue"
Configuration
| Parameter | Default | Description |
|---|---|---|
architecture | "pipeline" | speech-to-speech or pipeline |
stt_model | "whisper-large-v3" | Speech recognition model |
llm_model | "claude-sonnet-4-20250514" | Language model |
tts_model | "elevenlabs-turbo-v2.5" | Speech synthesis model |
streaming | True | Stream TTS during LLM generation |
vad_threshold | 0.5 | Voice detection sensitivity |
silence_duration_ms | 500 | Silence before turn ends |
max_turn_duration_s | 30 | Maximum speaking turn |
Best Practices
- Target < 1 second total latency — users perceive > 1.5s as unresponsive
- Stream TTS alongside LLM generation — don't wait for the full response before speaking
- Use server-side VAD — more reliable than client-side for turn detection
- Handle interruptions gracefully — stop speaking and listen when the user interrupts
- Optimize for first-byte latency — the time to first audio is more important than total duration
- Add filler words for long processing — "Let me check..." while LLM generates
Common Issues
High latency breaks conversation flow: Profile each component separately. Use streaming TTS. Consider speech-to-speech architecture for lowest latency. Pre-warm models to eliminate cold start.
Poor speech recognition accuracy: Use domain-specific STT models. Add keyword boosting for proper nouns. Implement real-time transcription correction.
Unnatural turn-taking: Tune VAD silence duration — too short causes interruptions, too long adds latency. Add backchanneling ("mm-hmm", "I see") during user speech.
Reviews
No reviews yet. Be the first to review this template!
Similar Templates
Full-Stack Code Reviewer
Comprehensive code review skill that checks for security vulnerabilities, performance issues, accessibility, and best practices across frontend and backend code.
Test Suite Generator
Generates comprehensive test suites with unit tests, integration tests, and edge cases. Supports Jest, Vitest, Pytest, and Go testing.
Pro Architecture Workspace
Battle-tested skill for architectural, decision, making, framework. Includes structured workflows, validation checks, and reusable patterns for development.