Tools

Tools: How to Easily Build a Voice Agent with AssemblyAI

2026-04-08 0 views admin

What Is an AI Voice Agent?

How AI Voice Agents Work

Core Components

Performance Requirements

Common Use Cases

Implementation Guide

Step 1: Environment Setup

Step 2: Audio Capture Class

Step 3: Streaming Speech-to-Text Integration

Step 4: LLM Response Generation

Step 5: Text-to-Speech

Complete Working Code

Production Considerations

Frequently Asked Questions Voice agents are software systems that engage in natural speech conversations with users. Unlike traditional phone menus requiring button presses, these agents process your speech as you talk, understanding your words before you finish speaking. The system processes conversations through a real-time pipeline with target response times under one second: 1. Streaming Speech-to-Text AssemblyAI's Universal-3 Pro Streaming achieves approximately 94% accuracy across varying audio conditions. Accuracy thresholds matter: 2. LLM and Orchestration Layer The language model serves as the agent's "brain," managing: 3. Text-to-Speech Synthesis ElevenLabs, Google Cloud, and OpenAI offer natural-sounding voices. The key is starting speech generation before the language model finishes writing the complete response for real-time flow. 4. Integration and Business Logic Voice agents connect to existing systems (CRM, calendars, payment processors, inventory management) with security considerations for API keys, encryption, and user authentication. Target metrics for production agents: Run with: python voice_agent.py Key requirements beyond prototypes: Infrastructure scaling involves WebSocket connection pooling, load balancing, database integration, and comprehensive monitoring. What response time targets matter?

Target under 1000ms total: ~200-400ms for speech recognition, ~300-600ms for language processing, ~200-400ms for synthesis, and ~100-200ms for network delays. How do I handle user interruptions?Implement barge-in detection by monitoring audio streams during agent responses and stopping text-to-speech when speech is detected. AssemblyAI's streaming API enables smooth interruption handling. Which language model is best?GPT-4 handles complex conversations requiring nuanced understanding; GPT-3.5 Turbo works for simpler interactions with lower latency and cost. Do I need custom speech recognition models?Modern pre-trained models handle most use cases without custom training. Universal-3 Pro achieves production accuracy out-of-the-box. Use custom vocabulary features for specialized terminology. How do I integrate with phone systems?

Cloud telephony providers like Twilio offer the easiest integration, or implement SIP trunking for direct infrastructure connection. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ mkdir voice-agent cd voice-agent python -m venv venv source venv/bin/activate # Mac/Linux # venv\Scripts\activate # Windows -weight: 500;">pip -weight: 500;">install assemblyai openai elevenlabs websockets pyaudio python-dotenv mkdir voice-agent cd voice-agent python -m venv venv source venv/bin/activate # Mac/Linux # venv\Scripts\activate # Windows -weight: 500;">pip -weight: 500;">install assemblyai openai elevenlabs websockets pyaudio python-dotenv mkdir voice-agent cd voice-agent python -m venv venv source venv/bin/activate # Mac/Linux # venv\Scripts\activate # Windows -weight: 500;">pip -weight: 500;">install assemblyai openai elevenlabs websockets pyaudio python-dotenv ASSEMBLYAI_API_KEY=your_key OPENAI_API_KEY=your_key ELEVENLABS_API_KEY=your_key ASSEMBLYAI_API_KEY=your_key OPENAI_API_KEY=your_key ELEVENLABS_API_KEY=your_key ASSEMBLYAI_API_KEY=your_key OPENAI_API_KEY=your_key ELEVENLABS_API_KEY=your_key class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) def generate_and_speak_response(self): messages = [{ "role": "system", "content": "Keep responses conversational and concise—aim for 1-2 sentences." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content def generate_and_speak_response(self): messages = [{ "role": "system", "content": "Keep responses conversational and concise—aim for 1-2 sentences." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content def generate_and_speak_response(self): messages = [{ "role": "system", "content": "Keep responses conversational and concise—aim for 1-2 sentences." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content def speak_text(self, text): audio = elevenlabs_client.generate( text=text, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) def speak_text(self, text): audio = elevenlabs_client.generate( text=text, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) def speak_text(self, text): audio = elevenlabs_client.generate( text=text, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) import os import threading from queue import Queue from dotenv import load_dotenv import assemblyai as aai from openai import OpenAI from elevenlabs.client import ElevenLabs from elevenlabs import stream import pyaudio load_dotenv() aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY') openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY')) elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY')) class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() def _capture_audio(self): while self.recording: try: self.audio_queue.put( self.stream.read(self.chunk_size, exception_on_overflow=False) ) except Exception as e: print(f"Audio error: {e}") class VoiceAgent: def __init__(self): self.conversation_history = [] self.is_processing = False self.audio_capture = AudioCapture() def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) if not self.is_processing: self.is_processing = True threading.Thread( target=self.generate_and_speak_response, daemon=True ).-weight: 500;">start() def generate_and_speak_response(self): try: messages = [{ "role": "system", "content": "You are a helpful voice assistant. Keep responses conversational." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content print(f"Agent: {ai_response}") self.conversation_history.append( {"role": "assistant", "content": ai_response} ) audio = elevenlabs_client.generate( text=ai_response, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) finally: self.is_processing = False def start_conversation(self): self.transcriber = aai.RealtimeTranscriber( sample_rate=16000, on_data=self.handle_transcript, on_error=lambda e: print(f"Speech error: {e}") ) try: self.transcriber.connect() self.audio_capture.start_recording() print("Voice Agent ready - -weight: 500;">start speaking!") while True: audio_chunk = self.audio_capture.get_audio_data() if audio_chunk: self.transcriber.stream(audio_chunk) except KeyboardInterrupt: self.audio_capture.stop_recording() self.transcriber.close() if __name__ == "__main__": VoiceAgent().start_conversation() import os import threading from queue import Queue from dotenv import load_dotenv import assemblyai as aai from openai import OpenAI from elevenlabs.client import ElevenLabs from elevenlabs import stream import pyaudio load_dotenv() aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY') openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY')) elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY')) class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() def _capture_audio(self): while self.recording: try: self.audio_queue.put( self.stream.read(self.chunk_size, exception_on_overflow=False) ) except Exception as e: print(f"Audio error: {e}") class VoiceAgent: def __init__(self): self.conversation_history = [] self.is_processing = False self.audio_capture = AudioCapture() def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) if not self.is_processing: self.is_processing = True threading.Thread( target=self.generate_and_speak_response, daemon=True ).-weight: 500;">start() def generate_and_speak_response(self): try: messages = [{ "role": "system", "content": "You are a helpful voice assistant. Keep responses conversational." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content print(f"Agent: {ai_response}") self.conversation_history.append( {"role": "assistant", "content": ai_response} ) audio = elevenlabs_client.generate( text=ai_response, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) finally: self.is_processing = False def start_conversation(self): self.transcriber = aai.RealtimeTranscriber( sample_rate=16000, on_data=self.handle_transcript, on_error=lambda e: print(f"Speech error: {e}") ) try: self.transcriber.connect() self.audio_capture.start_recording() print("Voice Agent ready - -weight: 500;">start speaking!") while True: audio_chunk = self.audio_capture.get_audio_data() if audio_chunk: self.transcriber.stream(audio_chunk) except KeyboardInterrupt: self.audio_capture.stop_recording() self.transcriber.close() if __name__ == "__main__": VoiceAgent().start_conversation() import os import threading from queue import Queue from dotenv import load_dotenv import assemblyai as aai from openai import OpenAI from elevenlabs.client import ElevenLabs from elevenlabs import stream import pyaudio load_dotenv() aai.settings.api_key = os.getenv('ASSEMBLYAI_API_KEY') openai_client = OpenAI(api_key=os.getenv('OPENAI_API_KEY')) elevenlabs_client = ElevenLabs(api_key=os.getenv('ELEVENLABS_API_KEY')) class AudioCapture: def __init__(self, sample_rate=16000): self.sample_rate = sample_rate self.chunk_size = 8000 self.audio_queue = Queue() self.recording = False self.audio = pyaudio.PyAudio() self.stream = None def start_recording(self): self.recording = True self.stream = self.audio.open( format=pyaudio.paInt16, channels=1, rate=self.sample_rate, input=True, frames_per_buffer=self.chunk_size ) thread = threading.Thread(target=self._capture_audio, daemon=True) thread.-weight: 500;">start() def _capture_audio(self): while self.recording: try: self.audio_queue.put( self.stream.read(self.chunk_size, exception_on_overflow=False) ) except Exception as e: print(f"Audio error: {e}") class VoiceAgent: def __init__(self): self.conversation_history = [] self.is_processing = False self.audio_capture = AudioCapture() def handle_transcript(self, transcript: aai.RealtimeTranscript): if not transcript.text: return if isinstance(transcript, aai.RealtimeFinalTranscript): print(f"You: {transcript.text}") self.conversation_history.append( {"role": "user", "content": transcript.text} ) if not self.is_processing: self.is_processing = True threading.Thread( target=self.generate_and_speak_response, daemon=True ).-weight: 500;">start() def generate_and_speak_response(self): try: messages = [{ "role": "system", "content": "You are a helpful voice assistant. Keep responses conversational." }] + self.conversation_history response = openai_client.chat.completions.create( model="gpt-4", messages=messages, temperature=0.7, max_tokens=150 ) ai_response = response.choices[0].message.content print(f"Agent: {ai_response}") self.conversation_history.append( {"role": "assistant", "content": ai_response} ) audio = elevenlabs_client.generate( text=ai_response, voice="Rachel", model="eleven_monolingual_v1" ) stream(audio) finally: self.is_processing = False def start_conversation(self): self.transcriber = aai.RealtimeTranscriber( sample_rate=16000, on_data=self.handle_transcript, on_error=lambda e: print(f"Speech error: {e}") ) try: self.transcriber.connect() self.audio_capture.start_recording() print("Voice Agent ready - -weight: 500;">start speaking!") while True: audio_chunk = self.audio_capture.get_audio_data() if audio_chunk: self.transcriber.stream(audio_chunk) except KeyboardInterrupt: self.audio_capture.stop_recording() self.transcriber.close() if __name__ == "__main__": VoiceAgent().start_conversation() - Real-time streaming speech-to-text - Language model comprehension - Text-to-speech synthesis - Conversation flow orchestration - Below 90%: Users experience frustration - 90-93%: Functional but with occasional errors - 93%+: Natural conversations with rare corrections - Intent recognition - Context tracking - Function calling for system integration - Response generation - Total response time: <1000ms - Speech accuracy: 93%+ - Name recognition: 95%+ - Number accuracy: 95%+ - Voice quality: Human-like - Customer support automation: Answer FAQs, check order -weight: 500;">status, escalate complex issues - Appointment scheduling: Check availability, confirm details, send confirmations - Lead qualification: Gather information, understand needs, route appropriately - After-hours -weight: 500;">service: Extend availability beyond business hours - Telephony integration (Twilio, SIP trunking) - Concurrent conversation handling - Error recovery for network failures - Performance monitoring and metrics - Security compliance and data protection

Share this article

Twitter Facebook LinkedIn Reddit

🏷️ Tags

toolsutilitiessecurity toolseasilybuildvoiceagentassemblyaiapt

More from Tools

Tools: to Block Internet Access for Any Linux App (While Keeping LAN) How (Update 2)

2026-04-08 0

Tools: Highlanders vs Brumbies 2026 FREE Stream – Ultimate Fan Guide

2026-04-08 0

Tools: TryHackMe - Fresher's guide to rule become top 20% easily.

2026-04-08 0

Tools: I built a TypeScript framework that generates your entire cloud infrastructure (2026)

2026-04-08 0

Trending

1

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

2025-10-27 • 189 views

2

CVE-2025-43939: Dell Unity OS Command Injection (High)

2025-10-30 • 148 views

3

Google disputes false claims of massive Gmail data breach

2025-10-30 • 130 views

4

Microsoft: DNS outage impacts Azure and Microsoft 365 services

2025-10-30 • 88 views

5

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting

2025-11-25 • 81 views

InfinitSec - Latest Cybersecurity, Technology & Gaming News

Tools: How to Easily Build a Voice Agent with AssemblyAI

What Is an AI Voice Agent?

How AI Voice Agents Work

Core Components

Performance Requirements

Common Use Cases

Implementation Guide

Step 1: Environment Setup

Step 2: Audio Capture Class

Step 3: Streaming Speech-to-Text Integration

Step 4: LLM Response Generation

Step 5: Text-to-Speech

Complete Working Code

Production Considerations

🏷️ Tags

More from Tools

Tools: to Block Internet Access for Any Linux App (While Keeping LAN) How (Update 2)

Tools: Highlanders vs Brumbies 2026 FREE Stream – Ultimate Fan Guide

Tools: TryHackMe - Fresher's guide to rule become top 20% easily.

Tools: I built a TypeScript framework that generates your entire cloud infrastructure (2026)

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting