P

Pro Voice Agents

Comprehensive skill designed for voice, agents, represent, frontier. Includes structured workflows, validation checks, and reusable patterns for ai research.

SkillClipticsai researchv1.0.0MIT
0 views0 copies

Pro Voice Agents

Production voice AI architecture for building real-time conversational agents — covering latency optimization, speech-to-text, text-to-speech, turn-taking, and the two core architecture patterns.

When to Use

Build voice agents when:

  • Need real-time voice conversations (customer support, virtual assistants)
  • Phone-based interactions requiring natural turn-taking
  • Applications where voice is the primary interface
  • Replacing IVR systems with intelligent voice bots

Use text-based interfaces when:

  • Latency tolerance is low but network conditions vary
  • Users prefer typing (chat support)
  • Complex data input (forms, code)

Quick Start

Architecture 1: Speech-to-Speech (Lowest Latency)

# End-to-end model: audio in → audio out # Latency: 200-500ms total from voice_agent import SpeechToSpeechAgent agent = SpeechToSpeechAgent( model="gpt-4o-realtime", voice="alloy", turn_detection="server_vad", # Voice Activity Detection temperature=0.8, ) # WebSocket connection async def handle_audio_stream(websocket): async for audio_chunk in websocket: response_audio = await agent.process(audio_chunk) await websocket.send(response_audio)

Architecture 2: Pipeline (Most Flexible)

# STT → LLM → TTS pipeline # Latency: 500-1500ms total from voice_agent import PipelineAgent agent = PipelineAgent( stt_model="whisper-large-v3", # Speech-to-text llm_model="claude-sonnet-4-20250514", # Language model tts_model="elevenlabs-turbo-v2.5", # Text-to-speech streaming=True, # Stream TTS as LLM generates ) async def handle_call(audio_stream): transcript = await agent.stt.transcribe(audio_stream) response_text = await agent.llm.generate(transcript) response_audio = await agent.tts.synthesize(response_text) return response_audio

Core Concepts

Latency Budget

ComponentTargetTypical
Speech-to-Text< 200ms100-300ms
LLM Processing< 500ms200-1000ms
Text-to-Speech< 200ms100-500ms
Network< 100ms20-100ms
Total Pipeline< 1000ms500-1500ms
Speech-to-Speech< 500ms200-500ms

Architecture Comparison

FeatureSpeech-to-SpeechPipeline
Latency200-500ms500-1500ms
FlexibilityLimitedHigh
Voice qualityNativeCustomizable
Language supportEnglish-focusedMultilingual
CostHigher (single model)Modular pricing
CustomizationLowHigh

Turn-Taking

# Voice Activity Detection (VAD) settings vad_config = { "type": "server_vad", "threshold": 0.5, # Sensitivity (0-1) "prefix_padding_ms": 300, # Audio before speech detected "silence_duration_ms": 500, # Silence before turn ends } # Interruption handling agent.on_interruption = "cancel_and_listen" # Stop speaking, listen # Options: "cancel_and_listen", "ignore", "queue"

Configuration

ParameterDefaultDescription
architecture"pipeline"speech-to-speech or pipeline
stt_model"whisper-large-v3"Speech recognition model
llm_model"claude-sonnet-4-20250514"Language model
tts_model"elevenlabs-turbo-v2.5"Speech synthesis model
streamingTrueStream TTS during LLM generation
vad_threshold0.5Voice detection sensitivity
silence_duration_ms500Silence before turn ends
max_turn_duration_s30Maximum speaking turn

Best Practices

  1. Target < 1 second total latency — users perceive > 1.5s as unresponsive
  2. Stream TTS alongside LLM generation — don't wait for the full response before speaking
  3. Use server-side VAD — more reliable than client-side for turn detection
  4. Handle interruptions gracefully — stop speaking and listen when the user interrupts
  5. Optimize for first-byte latency — the time to first audio is more important than total duration
  6. Add filler words for long processing — "Let me check..." while LLM generates

Common Issues

High latency breaks conversation flow: Profile each component separately. Use streaming TTS. Consider speech-to-speech architecture for lowest latency. Pre-warm models to eliminate cold start.

Poor speech recognition accuracy: Use domain-specific STT models. Add keyword boosting for proper nouns. Implement real-time transcription correction.

Unnatural turn-taking: Tune VAD silence duration — too short causes interruptions, too long adds latency. Add backchanneling ("mm-hmm", "I see") during user speech.

Community

Reviews

Write a review

No reviews yet. Be the first to review this template!

Similar Templates