Tools: Build an Agora voice agent with AssemblyAI's Voice Agent API (2026)

Tools: Build an Agora voice agent with AssemblyAI's Voice Agent API (2026)

Why combine Agora and the Voice Agent API

Architecture

Prerequisites

Quick start

1. Clone and install

2. Configure credentials

3. Run the bot

4. Connect a client

How it works

1. Connect to the Voice Agent API

2. Pull caller audio out of Agora

3. Stream audio to AssemblyAI

4. Publish the reply back into Agora

5. Handle barge-in

Tuning

Pick a different voice

Adjust turn detection

Boost domain-specific words

Add tools

Troubleshooting

Known limitations

Frequently asked questions

What is the AssemblyAI Voice Agent API?

How do I connect the Voice Agent API to Agora?

Can I use Agora's Conversational AI Engine instead?

What audio format does it use with Agora?

How does barge-in work?

Do I need an Agora App Certificate?

How much does it cost? Agora gives you battle-tested WebRTC: low-latency audio routing across 200+ countries, automatic codec negotiation, jitter buffers, NAT traversal, and SDKs for every client platform. The Voice Agent API is the AI brain in one connection. The system has three layers: The bot resamples between Agora's 16 kHz and the Voice Agent API's 24 kHz using SciPy's polyphase filter. Both sides use PCM16 mono. If your Agora project has App Certificate disabled, leave AGORA_APP_CERTIFICATE blank. Open Agora's Web demo, enter your App ID, the same channel name, a different UID, and click Join. Speak — the bot transcribes you live, the LLM replies, and the synthesized voice plays back through your browser. The bridge is two cooperating asyncio tasks — one pulling caller audio out of Agora and pushing it to AssemblyAI, the other pulling reply audio out of AssemblyAI and pushing it back into Agora. session.update is the first message and configures personality, greeting, and voice. The default audio format is audio/pcm — 24 kHz, 16-bit signed LE, mono. The bot registers an IAudioFrameObserver whose on_playback_audio_frame_before_mixing hook fires every 10 ms with one participant's audio frame. We resample 16 kHz → 24 kHz with SciPy's polyphase filter: call_soon_threadsafe is required because Agora's observer runs on a native C++ thread, not the asyncio loop. When reply.audio events arrive, we decode the base64 PCM, resample 24 kHz → 16 kHz, and hand it to AudioPcmDataSender: We pace the pushes to wall-clock time so a long reply doesn't blast into Agora's buffer in one go — that keeps barge-in responsive. The Voice Agent API also trims the transcript.agent event to what the bot actually got out before it was cut off — useful for accurate logging. Browse the full Voices catalog. Multilingual voices code-switch with English automatically. Register functions on session.tools to let the agent look up data, hit APIs, or trigger workflows. Full pattern in the tool calling docs. agora-python-server-sdk install fails on macOS. The package ships pre-built C++ wheels for Linux and macOS. If pip falls back to source build, install Xcode command-line tools (xcode-select --install) or run the bot in a Linux container. Bot joins but stays silent. Check that your client connected with the same AGORA_CHANNEL name and a different UID than AGORA_BOT_UID. Agora rejects duplicate UIDs. UNAUTHORIZED close from AssemblyAI. API key missing, expired, or wrong. Pull a fresh one from the AssemblyAI dashboard. Audio sounds chipmunky or sluggish. Sample-rate mismatch. Confirm set_playback_audio_frame_before_mixing_parameters(channels=1, sample_rate_hz=16000) and that resampling is on between Agora's 16 kHz and the API's 24 kHz. Bot interrupts itself. Acoustic loop somewhere — usually one client has speakers + mic open without echo cancellation. Browser clients should request getUserMedia({ audio: { echoCancellation: true } }). Token errors from Agora. If your project has App Certificate enabled, AGORA_APP_CERTIFICATE must be set and the bot UID + channel name must match what you signed. Full troubleshooting guide: Voice Agent API docs. No Windows wheels. Run inside WSL2 or a Linux Docker container. A single WebSocket endpoint that handles the entire voice agent pipeline server-side — speech recognition on Universal-3 Pro Streaming, LLM reasoning, and TTS with 30+ voices. It includes neural turn detection, barge-in, and tool calling. Run a server-side bot with agora-python-server-sdk. The bot joins the Agora channel, registers an IAudioFrameObserver to capture caller audio (16 kHz PCM), resamples to 24 kHz, and forwards each chunk to the Voice Agent API. Reply audio comes back, gets resampled to 16 kHz, and is published via AudioPcmDataSender. Yes — it supports AssemblyAI as the STT provider, but uses Agora's LLM and TTS layers. Use this tutorial when you want the full AI pipeline on AssemblyAI's Voice Agent API. The Voice Agent API defaults to audio/pcm at 24 kHz. Agora delivers 16 kHz PCM, so the bot resamples 16 kHz ↔ 24 kHz on each side using SciPy's polyphase filter. The Voice Agent API emits reply.done with status: "interrupted". The bridge flushes its outbound audio queue so the bot stops talking immediately. Only if your Agora project has it enabled. If so, set AGORA_APP_CERTIFICATE in .env. If disabled, leave it blank. AssemblyAI offers a free tier. For current pricing, see the AssemblyAI pricing page. Templates let you quickly answer FAQs or store snippets for re-use. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse

Command

Copy

$ -weight: 500;">git clone https://github.com/kelsey-aai/voice-agent-agora cd voice-agent-agora python -m venv .venv && source .venv/bin/activate -weight: 500;">pip -weight: 500;">install -r requirements.txt -weight: 500;">git clone https://github.com/kelsey-aai/voice-agent-agora cd voice-agent-agora python -m venv .venv && source .venv/bin/activate -weight: 500;">pip -weight: 500;">install -r requirements.txt -weight: 500;">git clone https://github.com/kelsey-aai/voice-agent-agora cd voice-agent-agora python -m venv .venv && source .venv/bin/activate -weight: 500;">pip -weight: 500;">install -r requirements.txt cp .env.example .env cp .env.example .env cp .env.example .env ASSEMBLYAI_API_KEY=your_assemblyai_key AGORA_APP_ID=your_agora_app_id AGORA_APP_CERTIFICATE=your_agora_app_certificate AGORA_CHANNEL=voice-agent-demo AGORA_BOT_UID=9999 ASSEMBLYAI_API_KEY=your_assemblyai_key AGORA_APP_ID=your_agora_app_id AGORA_APP_CERTIFICATE=your_agora_app_certificate AGORA_CHANNEL=voice-agent-demo AGORA_BOT_UID=9999 ASSEMBLYAI_API_KEY=your_assemblyai_key AGORA_APP_ID=your_agora_app_id AGORA_APP_CERTIFICATE=your_agora_app_certificate AGORA_CHANNEL=voice-agent-demo AGORA_BOT_UID=9999 python bot.py --channel voice-agent-demo python bot.py --channel voice-agent-demo python bot.py --channel voice-agent-demo URL = "wss://agents.assemblyai.com/v1/ws" headers = {"Authorization": f"Bearer {API_KEY}"} async with websockets.connect(URL, additional_headers=headers) as ws: await ws.send(json.dumps({ "type": "session.-weight: 500;">update", "session": { "system_prompt": "You are a friendly voice assistant.", "greeting": "Hi — I just joined the call.", "input": {"format": {"encoding": "audio/pcm"}}, "output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}}, }, })) \\ URL = "wss://agents.assemblyai.com/v1/ws" headers = {"Authorization": f"Bearer {API_KEY}"} async with websockets.connect(URL, additional_headers=headers) as ws: await ws.send(json.dumps({ "type": "session.-weight: 500;">update", "session": { "system_prompt": "You are a friendly voice assistant.", "greeting": "Hi — I just joined the call.", "input": {"format": {"encoding": "audio/pcm"}}, "output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}}, }, })) \\ URL = "wss://agents.assemblyai.com/v1/ws" headers = {"Authorization": f"Bearer {API_KEY}"} async with websockets.connect(URL, additional_headers=headers) as ws: await ws.send(json.dumps({ "type": "session.-weight: 500;">update", "session": { "system_prompt": "You are a friendly voice assistant.", "greeting": "Hi — I just joined the call.", "input": {"format": {"encoding": "audio/pcm"}}, "output": {"voice": "ivy", "format": {"encoding": "audio/pcm"}}, }, })) \\ def on_playback_audio_frame_before_mixing(self, channel_id, uid, frame): pcm16 = bytes(frame.buffer) # 16 kHz PCM16 pcm24 = resample_pcm16(pcm16, 16_000, 24_000) loop.call_soon_threadsafe(agent.inbound_audio.put_nowait, pcm24) return 0 def on_playback_audio_frame_before_mixing(self, channel_id, uid, frame): pcm16 = bytes(frame.buffer) # 16 kHz PCM16 pcm24 = resample_pcm16(pcm16, 16_000, 24_000) loop.call_soon_threadsafe(agent.inbound_audio.put_nowait, pcm24) return 0 def on_playback_audio_frame_before_mixing(self, channel_id, uid, frame): pcm16 = bytes(frame.buffer) # 16 kHz PCM16 pcm24 = resample_pcm16(pcm16, 16_000, 24_000) loop.call_soon_threadsafe(agent.inbound_audio.put_nowait, pcm24) return 0 chunk = await mic_queue.get() await ws.send(json.dumps({ "type": "input.audio", "audio": base64.b64encode(chunk).decode(), })) chunk = await mic_queue.get() await ws.send(json.dumps({ "type": "input.audio", "audio": base64.b64encode(chunk).decode(), })) chunk = await mic_queue.get() await ws.send(json.dumps({ "type": "input.audio", "audio": base64.b64encode(chunk).decode(), })) elif t == "reply.audio": pcm = base64.b64decode(event["data"]) await self.outbound_audio.put(pcm) pcm16 = resample_pcm16(pcm24, 24_000, 16_000) self.pcm_sender.send_audio_pcm_data( pcm16, 0, len(pcm16)//2, 2, 1, 16_000, ) elif t == "reply.audio": pcm = base64.b64decode(event["data"]) await self.outbound_audio.put(pcm) pcm16 = resample_pcm16(pcm24, 24_000, 16_000) self.pcm_sender.send_audio_pcm_data( pcm16, 0, len(pcm16)//2, 2, 1, 16_000, ) elif t == "reply.audio": pcm = base64.b64decode(event["data"]) await self.outbound_audio.put(pcm) pcm16 = resample_pcm16(pcm24, 24_000, 16_000) self.pcm_sender.send_audio_pcm_data( pcm16, 0, len(pcm16)//2, 2, 1, 16_000, ) elif t == "reply.done" and event.get("-weight: 500;">status") == "interrupted": while not outbound_audio.empty(): outbound_audio.get_nowait() elif t == "reply.done" and event.get("-weight: 500;">status") == "interrupted": while not outbound_audio.empty(): outbound_audio.get_nowait() elif t == "reply.done" and event.get("-weight: 500;">status") == "interrupted": while not outbound_audio.empty(): outbound_audio.get_nowait() "output": {"voice": "james"} # conversational US male "output": {"voice": "sophie"} # clear UK female "output": {"voice": "diego"} # Latin American Spanish "output": {"voice": "arjun"} # Hindi/Hinglish "output": {"voice": "james"} # conversational US male "output": {"voice": "sophie"} # clear UK female "output": {"voice": "diego"} # Latin American Spanish "output": {"voice": "arjun"} # Hindi/Hinglish "output": {"voice": "james"} # conversational US male "output": {"voice": "sophie"} # clear UK female "output": {"voice": "diego"} # Latin American Spanish "output": {"voice": "arjun"} # Hindi/Hinglish "input": { "turn_detection": { "vad_threshold": 0.5, "min_silence": 600, "max_silence": 1500, "interrupt_response": True, } } "input": { "turn_detection": { "vad_threshold": 0.5, "min_silence": 600, "max_silence": 1500, "interrupt_response": True, } } "input": { "turn_detection": { "vad_threshold": 0.5, "min_silence": 600, "max_silence": 1500, "interrupt_response": True, } } "input": {"keyterms": ["AssemblyAI", "Agora", "Universal-3"]} "input": {"keyterms": ["AssemblyAI", "Agora", "Universal-3"]} "input": {"keyterms": ["AssemblyAI", "Agora", "Universal-3"]} - Python 3.10+ - An Agora project with an App ID (and App Certificate if enabled) - An AssemblyAI API key — free tier available - Linux or macOS (the Agora native server SDK does not officially ship Windows wheels; use WSL2 or a Linux container on Windows) - agora-python-server-sdk is a beta wrapper around Agora's native C++ SDK. Class layouts have moved between minor versions. We pin 2.2.4 and document the exact API surface the bot uses. - Agora's recommended path for new voice-agent projects is the Conversational AI Engine — a hosted REST -weight: 500;">service. Use this tutorial when you want the full AI pipeline on AssemblyAI's Voice Agent API.