Tools
Tools: I Made a Product Demo Video Entirely with AI
2026-03-02
0 views
admin
The Pipeline ## Step 1: Claude Code Writes the Recorder ## Step 2: AI Voice with edge-tts ## Step 3: Assembly — Where Everything Broke ## Variable Speed: 30 Seconds of Spinner → 2 Seconds ## The Audio Sync Nightmare ## Gemini 3.1 Pro as Video QA ## On the Broken Version ## On the Fixed Version ## The Human-AI Feedback Loop ## What AI Did vs. What I Did ## The Final Numbers ## Honest Trade-off: Manual vs. Automated ## Beyond Demos: Video as QA Evidence ## Key Takeaways I needed a demo video for an RFP automation platform I'm building. The typical approach: record your screen, stumble through clicks, re-record when something breaks, then spend an hour in a video editor syncing voiceover. I've done it before. It's painful. So I tried a different approach: let AI do the whole thing. The result: a 3-minute narrated demo with 14 scenes, variable speed segments, and subtitles. No video editor. No screen recording app. No manual voiceover. 🎥 Watch the video demo on Loom Note: Interactive video player available on the original article Here's how the whole process worked — including the parts that broke. I described what I wanted: a demo video using Playwright to record the browser, with text-to-speech for coordinated voiceover, covering the full application workflow. Claude Code decided the structure — 14 scenes from dashboard to API docs — wrote the scene scripts, and generated the modules. Each one is a TypeScript function that drives the browser through a specific feature: A test wrapper runs all 14 scenes in sequence, recording timestamps: Output: one .webm video + a scene-timestamps.json file. This separation is key — it lets us manipulate each scene independently during assembly. Each scene has a narration line. edge-tts turns them into MP3 files using Microsoft's neural TTS — free, no API key, surprisingly natural: 14 scenes, 30 seconds to generate all voices. Claude Code wrote the narration script too — I reviewed and tweaked the phrasing, but the drafting was AI. Claude Code also wrote the assembly script. In theory, it's simple: split the video by timestamps, overlay voice, concatenate. In practice, this is where I spent most of the iteration time with Claude. Nobody wants to watch an AI loading spinner for 30 seconds. The solution: per-scene speed segments. The from/to values are proportional (0-10 scale). ffmpeg applies this via a split/trim/setpts/concat filtergraph. The result: boring waits are compressed 15x while meaningful interactions play at real speed. This is the lesson that took the most iterations to learn. When merging voice with video per-scene, then concatenating: Problem 1: Using ffmpeg's -shortest flag silently truncates the longer stream. Voice gets cut mid-sentence. Problem 2 (the nasty one): ffmpeg starts each concatenated clip's audio where the previous clip's audio ended, not where the video starts. If clip A has 30s video but only 13s audio, clip B's audio starts at t=13 instead of t=30. This causes progressive drift — by scene 10, the voice is over a minute behind the visuals. The fix: Every clip's audio track must exactly match its video duration: apad=whole_dur pads the audio with silence to exactly match the video length. No drift possible. Here's where it got really interesting. After fixing the pipeline, I needed to verify audio/video sync across 14 scenes. Watching the whole video manually each time is tedious and my ears aren't reliable after the 10th iteration. I uploaded the video to Gemini 3.1 Pro and asked it to analyze the synchronization — which actions happen visually vs. when the narration describes them. Gemini caught every single sync issue with precise timestamps: Classic progressive drift. Each scene's audio shifts further behind because the previous scene's audio track was shorter than its video. After applying the apad fix, Gemini's analysis: 13 scenes perfect sync, 1 flagged as ~2 seconds late. The flagged scene was actually fine — I'd intentionally added a 1.5-second voice delay to let a visual transition settle before narration began. Gemini was being slightly over-strict. Score: 0 missed issues, 1 false positive out of 14 scenes. That's better QA than I'd get from watching the video myself. The process wasn't "ask Claude once, get perfect video." It was iterative: Each round: I watch the video, describe what's wrong in plain language, Claude debugs and fixes. The feedback loop is fast because re-recording takes 4 minutes and assembly takes 1 minute. The creative direction was mine. Everything else was AI. If the UI changes tomorrow, I update one scene file and re-run. The entire pipeline is version-controlled and reproducible. For the first iteration, manually recording your screen while narrating would be faster. A screen recording tool gives you a video in real time — no pipeline to build. But manual recording has its own costs: you need a quiet environment and a decent microphone, any stumble means re-recording, editing voiceover timing in a video editor is tedious, and audio quality depends entirely on your hardware. The automated approach pays off from the second iteration onward. When the UI changed, I updated one scene file and re-ran. When the narration needed tweaking, I edited a text string — no re-recording my voice. After five rounds of feedback-and-fix, I'd have spent hours in a video editor doing the same thing manually. And if a client asks for a demo next month after a redesign, it's a 6-minute re-run, not a full re-shoot. This pipeline was built for a product demo, but the pattern — browser automation producing narrated video — has broader implications. Think about QA. Today, test evidence is usually a CI log that says PASS or FAIL. When a client asks "show me that the payment flow works," you re-run the test and hope they trust a green checkmark. Imagine instead handing them a narrated video: the test runs, the voiceover explains each step, and the video is generated automatically on every release. Regression testing becomes not just a technical checkpoint but a reviewable artifact. The same applies to compliance and auditing. Regulated industries need proof that systems work as specified. A version-controlled pipeline that produces timestamped video evidence on demand is fundamentally different from manual screen recordings buried in a shared drive. And onboarding — new team members could watch auto-generated walkthroughs that stay current with the actual UI, not documentation screenshots from six months ago. The underlying shift is that video is becoming a programmatic output, not a creative production. When the cost of producing a video drops from hours to minutes, and re-producing it is a single command, you start using video in places where it was never practical before. Tools used: Claude Code, Playwright, ffmpeg, edge-tts, Gemini 3.1 Pro Originally published on javieraguilar.ai Want to see more AI agent projects? Check out my portfolio where I showcase multi-agent systems, MCP development, and compliance automation. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to ? It will become hidden in your post, but will still be visible via the comment's permalink. as well , this person and/or CODE_BLOCK: Playwright (record) → edge-tts (voice) → ffmpeg (assemble) → Gemini (QA) CODE_BLOCK: Playwright (record) → edge-tts (voice) → ffmpeg (assemble) → Gemini (QA) CODE_BLOCK: Playwright (record) → edge-tts (voice) → ffmpeg (assemble) → Gemini (QA) COMMAND_BLOCK: export const scene: SceneFn = async (page) => { await page.getByText("Generate Answer").click(); // Wait for AI to finish const indicator = page.getByText("Generating AI answer..."); await indicator.waitFor({ state: "hidden", timeout: 90_000 }); // Scroll through the answer smoothly const scrollPanel = page.locator("[data-demo-scroll=true]"); await scrollPanel.evaluate((el) => { el.scrollTo({ top: el.scrollHeight / 3, behavior: "smooth" }); }); await longPause(page); }; COMMAND_BLOCK: export const scene: SceneFn = async (page) => { await page.getByText("Generate Answer").click(); // Wait for AI to finish const indicator = page.getByText("Generating AI answer..."); await indicator.waitFor({ state: "hidden", timeout: 90_000 }); // Scroll through the answer smoothly const scrollPanel = page.locator("[data-demo-scroll=true]"); await scrollPanel.evaluate((el) => { el.scrollTo({ top: el.scrollHeight / 3, behavior: "smooth" }); }); await longPause(page); }; COMMAND_BLOCK: export const scene: SceneFn = async (page) => { await page.getByText("Generate Answer").click(); // Wait for AI to finish const indicator = page.getByText("Generating AI answer..."); await indicator.waitFor({ state: "hidden", timeout: 90_000 }); // Scroll through the answer smoothly const scrollPanel = page.locator("[data-demo-scroll=true]"); await scrollPanel.evaluate((el) => { el.scrollTo({ top: el.scrollHeight / 3, behavior: "smooth" }); }); await longPause(page); }; CODE_BLOCK: for (const id of sceneIds) { const start = (Date.now() - videoStartTime) / 1000; await SCENES[id](page); timestamps.push({ id, start, end: (Date.now() - videoStartTime) / 1000 }); } CODE_BLOCK: for (const id of sceneIds) { const start = (Date.now() - videoStartTime) / 1000; await SCENES[id](page); timestamps.push({ id, start, end: (Date.now() - videoStartTime) / 1000 }); } CODE_BLOCK: for (const id of sceneIds) { const start = (Date.now() - videoStartTime) / 1000; await SCENES[id](page); timestamps.push({ id, start, end: (Date.now() - videoStartTime) / 1000 }); } CODE_BLOCK: edge-tts --text "Let's generate an AI answer..." \ --voice en-US-GuyNeural \ --write-media voice/08-generate-answer.mp3 CODE_BLOCK: edge-tts --text "Let's generate an AI answer..." \ --voice en-US-GuyNeural \ --write-media voice/08-generate-answer.mp3 CODE_BLOCK: edge-tts --text "Let's generate an AI answer..." \ --voice en-US-GuyNeural \ --write-media voice/08-generate-answer.mp3 CODE_BLOCK: { id: "08-generate-answer", speed: [ { from: 0, to: 1, speed: 1 }, // Click button at normal speed { from: 1, to: 6, speed: 15 }, // AI generation: 30s → 2s { from: 6, to: 10, speed: 1 }, // Read the answer at normal speed ], } CODE_BLOCK: { id: "08-generate-answer", speed: [ { from: 0, to: 1, speed: 1 }, // Click button at normal speed { from: 1, to: 6, speed: 15 }, // AI generation: 30s → 2s { from: 6, to: 10, speed: 1 }, // Read the answer at normal speed ], } CODE_BLOCK: { id: "08-generate-answer", speed: [ { from: 0, to: 1, speed: 1 }, // Click button at normal speed { from: 1, to: 6, speed: 15 }, // AI generation: 30s → 2s { from: 6, to: 10, speed: 1 }, // Read the answer at normal speed ], } CODE_BLOCK: ffmpeg -i video.mp4 -i voice.mp3 \ -filter_complex "[1:a]adelay=500|500,apad=whole_dur=VIDEO_DURATION[audio]" \ -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4 CODE_BLOCK: ffmpeg -i video.mp4 -i voice.mp3 \ -filter_complex "[1:a]adelay=500|500,apad=whole_dur=VIDEO_DURATION[audio]" \ -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4 CODE_BLOCK: ffmpeg -i video.mp4 -i voice.mp3 \ -filter_complex "[1:a]adelay=500|500,apad=whole_dur=VIDEO_DURATION[audio]" \ -map 0:v -map "[audio]" -c:v copy -c:a aac output.mp4 COMMAND_BLOCK: # Record 14 scenes npx playwright test e2e/tests/demo-record.spec.ts --headed # ~4 min # Generate voices npx tsx scripts/demo-record.ts voice # ~30 sec # Assemble with speed control + subtitles npx tsx scripts/demo-record.ts assemble # ~1 min COMMAND_BLOCK: # Record 14 scenes npx playwright test e2e/tests/demo-record.spec.ts --headed # ~4 min # Generate voices npx tsx scripts/demo-record.ts voice # ~30 sec # Assemble with speed control + subtitles npx tsx scripts/demo-record.ts assemble # ~1 min COMMAND_BLOCK: # Record 14 scenes npx playwright test e2e/tests/demo-record.spec.ts --headed # ~4 min # Generate voices npx tsx scripts/demo-record.ts voice # ~30 sec # Assemble with speed control + subtitles npx tsx scripts/demo-record.ts assemble # ~1 min - Claude Code wrote the entire recording pipeline — Playwright scripts, ffmpeg assembly, speed control, subtitle generation - edge-tts generated the narration with Microsoft's neural voices - Gemini 3.1 Pro reviewed the final video for audio/video sync issues - Me: "I want a demo video using Playwright with TTS, covering the full workflow" - Claude: Decides on 14 scenes, generates the pipeline. First recording works. - Me: "The voice is desynced from scene 5 onwards" - Claude: Debugs, discovers -shortest issue. Fixes with apad. - Me: "The answer doesn't scroll — you can't see the bullet points" - Claude: Investigates DOM, finds wrong scroll container. Fixes with programmatic parent discovery. - Me: "Still no bullet points in the generated answer" - Claude: Tests via API, finds the AI returns plain text despite HTML instructions. Adds normalizeAnswerHtml post-processor. - Me: "The KB upload scene has too much dead time, and the style guide voice starts too early" - Claude: Increases speed compression from 4x to 8x, adds 2s voice delay to style guide scene. - Output: 3:47 narrated video, 14 scenes, variable speed, soft subtitles - Pipeline code: ~800 lines TypeScript (assembly) + ~200 lines (scenes) - Re-record time: Under 6 minutes end-to-end - Video editors used: Zero - Playwright's recordVideo is production-quality for demos — 720p/25fps, no overhead - Never use -shortest in ffmpeg when merging audio streams for concatenation. Use apad=whole_dur to match audio duration to video duration exactly. - Variable speed segments are the difference between a boring demo and a watchable one. 15x compression for loading spinners, 1x for actual interactions. - Gemini 3.1 Pro is a legitimate video QA tool. Upload a video, ask "is the audio synced with the visuals?" — it'll give you a timestamped report with near-perfect accuracy. - The human-AI feedback loop matters more than getting it right first try. I described problems in plain language ("the scroll doesn't work"), Claude debugged and fixed. Five iterations to a polished result. - AI is great at automation, humans are great at judgment. I wrote the narration script and decided what to show. AI did everything else. - When video becomes a command, you use it everywhere. The same pipeline that records a demo can generate QA evidence, onboarding walkthroughs, or compliance artifacts — all version-controlled and reproducible on every release.
toolsutilitiessecurity toolsproductvideoentirelypipelineclaude