Tools: How to Easily Build a Voice Agent with AssemblyAI
What Is an AI Voice Agent?
How AI Voice Agents Work
Core Components
Performance Requirements
Common Use Cases
Implementation Guide
Step 1: Environment Setup
Step 2: Audio Capture Class
Step 3: Streaming Speech-to-Text Integration
Step 4: LLM Response Generation
Step 5: Text-to-Speech
Complete Working Code
Production Considerations
Frequently Asked Questions Voice agents are software systems that engage in natural speech conversations with users. Unlike traditional phone menus requiring button presses, these agents process your speech as you talk, understanding your words before you finish speaking. The system processes conversations through a real-time pipeline with target response times under one second: 1. Streaming Speech-to-Text AssemblyAI's Universal-3 Pro Streaming achieves approximately 94% accuracy across varying audio conditions. Accuracy thresholds matter: 2. LLM and Orchestration Layer The language model serves as the agent's "brain," managing: 3. Text-to-Speech Synthesis ElevenLabs, Google Cloud, and OpenAI offer natural-sounding voices. The key is starting speech generation before the language model finishes writing the complete response for real-time flow. 4. Integration and Business Logic Voice agents connect to existing systems (CRM, calendars, payment processors, inventory management) with security considerations for API keys, encryption, and user authentication. Target metrics for production agents: Run with: python voice_agent.py Key requirements beyond prototypes: Infrastructure scaling involves WebSocket connection pooling, load balancing, database integration, and comprehensive monitoring. What response time targets matter?
Target under 1000ms total: ~200-400ms for speech recognition, ~300-600ms for language processing, ~200-400ms for synthesis, and ~100-200ms for network delays. How do I handle user interruptions?Implement barge-in detection by monitoring audio streams during agent responses and stopping text-to-speech when speech is detected. AssemblyAI's streaming API enables smooth interruption handling. Which language model is best?GPT-4 handles complex conversations requiring nuanced understanding; GPT-3.5 Turbo works for simpler interactions with lower latency and cost. Do I need custom speech recognition models?Modern pre-trained models handle most use cases without custom training. Universal-3 Pro achieves production accuracy out-of-the-box. Use custom vocabulary features for specialized terminology. How do I integrate with phone systems?
Cloud telephony providers like Twilio offer the easiest integration, or implement SIP trunking for direct infrastructure connection. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse
$ mkdir voice-agent
cd voice-agent
python -m venv venv
source venv/bin/activate # Mac/Linux
# venv\Scripts\activate # Windows -weight: 500;">pip -weight: 500;">install assemblyai openai elevenlabs websockets pyaudio python-dotenv
mkdir voice-agent
cd voice-agent
python -m venv venv
source venv/bin/activate # Mac/Linux
# venv\Scripts\activate # Windows -weight: 500;">pip -weight: 500;">install assemblyai openai elevenlabs websockets pyaudio python-dotenv
mkdir voice-agent
cd voice-agent
python -m venv venv
source venv/bin/activate # Mac/Linux
# venv\Scripts\activate # Windows -weight: 500;">pip -weight: 500;">install assemblyai openai elevenlabs websockets pyaudio python-dotenv
ASSEMBLYAI_API_KEY=your_key
OPENAI_API_KEY=your_key
ELEVENLABS_API_KEY=your_key
ASSEMBLYAI_API_KEY=your_key
OPENAI_API_KEY=your_key
ELEVENLABS_API_KEY=your_key
ASSEMBLYAI_API_KEY=your_key
OPENAI_API_KEY=your_key
ELEVENLABS_API_KEY=your_key
class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start()
class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start()
class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start()
def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} )
def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} )
def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} )
def generate_and_speak_response(self): messages = [{ "role": "system", "content": "Keep responses conversational and concise—aim for 1-2 sentences." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content
def generate_and_speak_response(self): messages = [{ "role": "system", "content": "Keep responses conversational and concise—aim for 1-2 sentences." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content
def generate_and_speak_response(self): messages = [{ "role": "system", "content": "Keep responses conversational and concise—aim for 1-2 sentences." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content
def speak_text(self, text): audio = elevenlabs_client.generate( text=text, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio)
def speak_text(self, text): audio = elevenlabs_client.generate( text=text, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio)
def speak_text(self, text): audio = elevenlabs_client.generate( text=text, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio)
import os
import threading
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio load_dotenv() aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY')) class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() def _capture_audio(self): while self.recording: try: self.audio_queue.put( self.stream.read(self.chunk_size, exception_on_overflow=False) ) except Exception as e: print(f"Audio error: {e}") class VoiceAgent: def __init__(self): self.conversation_history = [] self.is_processing = False self.audio_capture = AudioCapture() def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) if not self.is_processing: self.is_processing = True threading.Thread( target=self.generate_and_speak_response, daemon=True ).-weight: 500;">start() def generate_and_speak_response(self): try: messages = [{ "role": "system", "content": "You are a helpful voice assistant. Keep responses conversational." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content print(f"Agent: {ai_response}") self.conversation_history.append( {"role": "assistant", "content": ai_response} ) audio = elevenlabs_client.generate( text=ai_response, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) finally: self.is_processing = False def start_conversation(self): self.transcriber = aai.RealtimeTranscriber( sample_rate=16000, on_data=self.handle_transcript, on_error=lambda e: print(f"Speech error: {e}") ) try: self.transcriber.connect() self.audio_capture.start_recording() print("Voice Agent ready - -weight: 500;">start speaking!") while True: audio_chunk = self.audio_capture.get_audio_data() if audio_chunk: self.transcriber.stream(audio_chunk) except KeyboardInterrupt: self.audio_capture.stop_recording() self.transcriber.close() if __name__ == "__main__": VoiceAgent().start_conversation()
import os
import threading
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio load_dotenv() aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY')) class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() def _capture_audio(self): while self.recording: try: self.audio_queue.put( self.stream.read(self.chunk_size, exception_on_overflow=False) ) except Exception as e: print(f"Audio error: {e}") class VoiceAgent: def __init__(self): self.conversation_history = [] self.is_processing = False self.audio_capture = AudioCapture() def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) if not self.is_processing: self.is_processing = True threading.Thread( target=self.generate_and_speak_response, daemon=True ).-weight: 500;">start() def generate_and_speak_response(self): try: messages = [{ "role": "system", "content": "You are a helpful voice assistant. Keep responses conversational." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content print(f"Agent: {ai_response}") self.conversation_history.append( {"role": "assistant", "content": ai_response} ) audio = elevenlabs_client.generate( text=ai_response, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) finally: self.is_processing = False def start_conversation(self): self.transcriber = aai.RealtimeTranscriber( sample_rate=16000, on_data=self.handle_transcript, on_error=lambda e: print(f"Speech error: {e}") ) try: self.transcriber.connect() self.audio_capture.start_recording() print("Voice Agent ready - -weight: 500;">start speaking!") while True: audio_chunk = self.audio_capture.get_audio_data() if audio_chunk: self.transcriber.stream(audio_chunk) except KeyboardInterrupt: self.audio_capture.stop_recording() self.transcriber.close() if __name__ == "__main__": VoiceAgent().start_conversation()
import os
import threading
from queue import Queue
from dotenv import load_dotenv
import assemblyai as aai
from openai import OpenAI
from elevenlabs.client import ElevenLabs
from elevenlabs import stream
import pyaudio load_dotenv() aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY')
openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY'))
elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY')) class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() def _capture_audio(self): while self.recording: try: self.audio_queue.put( self.stream.read(self.chunk_size, exception_on_overflow=False) ) except Exception as e: print(f"Audio error: {e}") class VoiceAgent: def __init__(self): self.conversation_history = [] self.is_processing = False self.audio_capture = AudioCapture() def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) if not self.is_processing: self.is_processing = True threading.Thread( target=self.generate_and_speak_response, daemon=True ).-weight: 500;">start() def generate_and_speak_response(self): try: messages = [{ "role": "system", "content": "You are a helpful voice assistant. Keep responses conversational." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content print(f"Agent: {ai_response}") self.conversation_history.append( {"role": "assistant", "content": ai_response} ) audio = elevenlabs_client.generate( text=ai_response, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) finally: self.is_processing = False def start_conversation(self): self.transcriber = aai.RealtimeTranscriber( sample_rate=16000, on_data=self.handle_transcript, on_error=lambda e: print(f"Speech error: {e}") ) try: self.transcriber.connect() self.audio_capture.start_recording() print("Voice Agent ready - -weight: 500;">start speaking!") while True: audio_chunk = self.audio_capture.get_audio_data() if audio_chunk: self.transcriber.stream(audio_chunk) except KeyboardInterrupt: self.audio_capture.stop_recording() self.transcriber.close() if __name__ == "__main__": VoiceAgent().start_conversation() - Real-time streaming speech-to-text
- Language model comprehension
- Text-to-speech synthesis
- Conversation flow orchestration - Below 90%: Users experience frustration
- 90-93%: Functional but with occasional errors
- 93%+: Natural conversations with rare corrections - Intent recognition
- Context tracking
- Function calling for system integration
- Response generation - Total response time: <1000ms
- Speech accuracy: 93%+
- Name recognition: 95%+
- Number accuracy: 95%+
- Voice quality: Human-like - Customer support automation: Answer FAQs, check order -weight: 500;">status, escalate complex issues
- Appointment scheduling: Check availability, confirm details, send confirmations
- Lead qualification: Gather information, understand needs, route appropriately
- After-hours -weight: 500;">service: Extend availability beyond business hours - Telephony integration (Twilio, SIP trunking)
- Concurrent conversation handling
- Error recovery for network failures
- Performance monitoring and metrics
- Security compliance and data protection