Tools

Tools: Building Viva: A Real-Time AI Interview Coach with Gemini Live API

2026-03-08 0 views admin

Tools: Building Viva: A Real-Time AI Interview Coach with Gemini Live API

Source: Dev.to

The Problem ## What Viva Does ## Architecture ## The Gemini Live API Pipeline ## Body Language Analysis ## ADK Agent Tools ## Google Cloud Services ## Infrastructure as Code ## Mock Mode ## Try It TL;DR: I built Viva, a real-time AI interview coach that listens to your answers via bidirectional audio streaming and watches your body language through your webcam — all powered by Google's Gemini Live API and Vision API, deployed on Cloud Run. Job seekers practice interviews alone with zero feedback. You can record yourself on your phone and watch it back, but that doesn't tell you about your filler words, pacing, eye contact, or posture in real-time. Human coaches cost $100-300 per session. Viva is a full-stack interview coaching application that provides real-time feedback on both verbal answers and body language: The core of Viva is the bidirectional audio pipeline. Here's how it works: Browser captures mic audio using Web Audio API's AudioWorklet. The worklet converts Float32 samples to 16-bit PCM at 16kHz. PCM chunks stream over WebSocket to the FastAPI backend. Backend forwards to Gemini Live API using the Google GenAI SDK: Gemini responds with audio — the AI interviewer's voice streams back as 24kHz PCM. Browser plays the response through a custom PcmPlayer that buffers and schedules audio chunks for smooth playback. The barge-in capability is built into the Live API — when the user starts speaking while the AI is talking, the AI naturally stops and listens. Every 2 seconds, the frontend captures a JPEG frame from the webcam, downscales to 640x480, and sends it to the backend. The backend uses Gemini Vision to analyze: The analysis is returned as structured coaching tips that appear as a live overlay on the interview screen. The backend uses Google's Agent Development Kit (ADK) with four tools: The entire deployment is automated via a single deploy.sh script: This handles: API enablement, Secret Manager setup, container building, Cloud Run deployment, and optional Vercel frontend deployment. Viva runs fully without a Gemini API key — all AI features fall back to realistic mock responses. This makes local development and testing seamless. Built for the Gemini Live Agent Challenge. #GeminiLiveAgentChallenge Built with Gemini Live API, Gemini Vision API, Google ADK, FastAPI, Next.js, and Cloud Run. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: Browser (Next.js) Google Cloud ├─ Mic → AudioWorklet ┌──────────────────┐ │ → PCM 16kHz ──WebSocket──► │ Cloud Run │ │ │ (FastAPI) │ │ │ │ │ │ ◄── PCM 24kHz ◄────────── │ Gemini Live API │ │ → PcmPlayer → Speaker │ (bidi audio) │ │ │ │ │ ├─ Camera → JPEG frames │ Gemini Vision │ │ → POST /api/analyze-frame ──► │ (body language) │ │ │ │ │ └─ Score/Report ◄────────────── │ ADK Agent Tools │ └──────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Browser (Next.js) Google Cloud ├─ Mic → AudioWorklet ┌──────────────────┐ │ → PCM 16kHz ──WebSocket──► │ Cloud Run │ │ │ (FastAPI) │ │ │ │ │ │ ◄── PCM 24kHz ◄────────── │ Gemini Live API │ │ → PcmPlayer → Speaker │ (bidi audio) │ │ │ │ │ ├─ Camera → JPEG frames │ Gemini Vision │ │ → POST /api/analyze-frame ──► │ (body language) │ │ │ │ │ └─ Score/Report ◄────────────── │ ADK Agent Tools │ └──────────────────┘ CODE_BLOCK: Browser (Next.js) Google Cloud ├─ Mic → AudioWorklet ┌──────────────────┐ │ → PCM 16kHz ──WebSocket──► │ Cloud Run │ │ │ (FastAPI) │ │ │ │ │ │ ◄── PCM 24kHz ◄────────── │ Gemini Live API │ │ → PcmPlayer → Speaker │ (bidi audio) │ │ │ │ │ ├─ Camera → JPEG frames │ Gemini Vision │ │ → POST /api/analyze-frame ──► │ (body language) │ │ │ │ │ └─ Score/Report ◄────────────── │ ADK Agent Tools │ └──────────────────┘ CODE_BLOCK: session = await client.aio.live.connect( model="gemini-2.5-flash-native-audio-latest", config=types.LiveConnectConfig( response_modalities=["AUDIO"], system_instruction=types.Content( parts=[types.Part(text=system_prompt)] ), speech_config=types.SpeechConfig( voice_config=types.VoiceConfig( prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Kore") ) ), ), ) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: session = await client.aio.live.connect( model="gemini-2.5-flash-native-audio-latest", config=types.LiveConnectConfig( response_modalities=["AUDIO"], system_instruction=types.Content( parts=[types.Part(text=system_prompt)] ), speech_config=types.SpeechConfig( voice_config=types.VoiceConfig( prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Kore") ) ), ), ) CODE_BLOCK: session = await client.aio.live.connect( model="gemini-2.5-flash-native-audio-latest", config=types.LiveConnectConfig( response_modalities=["AUDIO"], system_instruction=types.Content( parts=[types.Part(text=system_prompt)] ), speech_config=types.SpeechConfig( voice_config=types.VoiceConfig( prebuilt_voice_config=types.PrebuiltVoiceConfig(voice_name="Kore") ) ), ), ) CODE_BLOCK: ./deploy.sh Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ./deploy.sh CODE_BLOCK: ./deploy.sh - Live audio conversation — bidirectional audio streaming via Gemini Live API (gemini-2.5-flash-native-audio-latest). The AI interviewer asks questions, listens to your answers, and responds naturally. You can interrupt mid-sentence (barge-in). - Body language coaching — webcam frames analyzed every 2 seconds via Gemini Vision (gemini-2.5-flash). You get feedback on eye contact, posture, facial expressions, and confidence. - Speech pattern tracking — filler word detection ("um", "uh", "like"), pace analysis, confidence scoring. - Answer scoring — each answer scored on relevance, clarity, and depth. - Post-interview report — full scorecard with per-question breakdown and aggregate stats. - Browser captures mic audio using Web Audio API's AudioWorklet. The worklet converts Float32 samples to 16-bit PCM at 16kHz. - PCM chunks stream over WebSocket to the FastAPI backend. - Backend forwards to Gemini Live API using the Google GenAI SDK: - Gemini responds with audio — the AI interviewer's voice streams back as 24kHz PCM. - Browser plays the response through a custom PcmPlayer that buffers and schedules audio chunks for smooth playback. - Eye contact (looking at camera vs. looking away) - Posture (sitting straight, slouching, leaning) - Facial expressions (smiling, nervous, neutral) - Hand gestures - GitHub: https://github.com/astraedus/viva - Live Demo: https://viva-api-93135657352.us-central1.run.app

🏷️ Tags

how-totutorialguidedev.toaimlgitgithub