Tools: The AI That Never Forgets — My Vision Agents Hackathon Journey

Tools: The AI That Never Forgets — My Vision Agents Hackathon Journey

Source: Dev.to

How I solved video AI's biggest blind spot — amnesia — by building a real-time temporal memory engine on top of the Vision Agents SDK by Stream. ## https://img.shields.io/badge/ARGUS-The_AI_That_Never_Forgets-00d4aa?style=for-the-badge&logo=openai&logoColor=white&labelColor=101010&logoWidth=30&pd=20 ## 🎥 Watch ARGUS in Action ## 🚨 The Problem: AI Has Amnesia ## 🧠 What is ARGUS? ## Core logic: If an object moves zones, log it. 🏆 Built for the Vision Possible: Agent Protocol Hackathon ⚡ Powered by Vision Agents SDK by Stream Before I explain how I built it, you have to see it to believe it. Here is ARGUS detecting objects, tracking them over time, and answering questions about the past in real-time. (If the video doesn't load, click here to watch the demo I realized something frustrating while testing modern Video AI demos. They are brilliant at telling you what is happening right now, but they are terrible at telling you what happened 5 minutes ago. If I drop my keys and ask a standard AI agent, "Where are my keys?", it looks at the current frame, sees nothing, and says: "I don't see any keys." The Vision Agents SDK documentation actually highlighted this limitation: "Longer videos can cause the AI to lose context. For instance, if it's watching a soccer match, it will get confused after 30 seconds." That was my lightbulb moment. 💡 The Mission: Build ARGUS, a real-time agent that doesn't just "see" video—it remembers it. ARGUS is a multimodal AI agent that watches live video, tracks objects using computer vision, and maintains a Temporal Memory Engine. Unlike standard agents that process Frame → Detect → Forget, ARGUS uses a stateful pipeline: mermaid graph LR A[Camera Feed] --> B(YOLO26 Detection) B --> C{Temporal Memory Engine} C --> D[Update Object History] C --> E[Log Events] D & E --> F[LLM Context] F --> G((Voice Response)) if old_zone != zone: self._log("moved", f"{class_name} (ID:{track_id}) moved from {old_zone} to {zone}") When I ask, "Where is the cup?", the LLM receives this context injection: Model Speed Max FPS Verdict YOLO26 Nano 130ms 7.7 ✅ Winner YOLOv8 Nano 138ms 7.2 Solid YOLO11 Small 310ms 3.2 Too slow Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Key Capabilities 👁️ Real-time Tracking: Uses YOLO26 Nano + ByteTrack to assign persistent IDs to objects. 🕰️ Time Travel: Can answer "What did I hold up 2 minutes ago?" 📍 Spatial Awareness: Converts raw coordinates into human terms like "top-left" or "center." 🗣️ Voice Interaction: Full duplex voice conversation with <1s latency. 💬 Real Conversations With ARGUS These are actual interactions from my testing sessions: Terminal Logs showing events Real-time event logging showing objects appearing and moving. I Said ARGUS Responded "What do you see?" "Person ID:2 at middle-center, Cup ID:3 at bottom-right" "What am I holding?" "You appear to be holding a bottle, ID:7" "What just moved?" "Cup moved from bottom-left to bottom-right at 2:05 PM" "Summarize everything" "Person appeared at center 30s ago. Cup moved left to right at 2:05" ⚡ Response time: ~1 second 🧠 All answers came from temporal memory — not from re-analyzing the video frame. 🏗️ The Architecture & Tech Stack I needed a stack that was fast, cheap, and capable of handling real-time video streams without melting my laptop. Component Technology Why I Chose It Framework Vision Agents SDK It handled all the WebRTC/Audio/Video piping for me. Vision Model YOLO26 Nano Benchmarked at 130ms/frame on CPU. Fast & Accurate. Reasoning Llama 3.3 via OpenRouter Fast inference with tool-calling capabilities. Speech Deepgram (STT) + ElevenLabs (TTS) The lowest latency combo available. Transport Stream Edge Network Kept video latency under 30ms. 🛠️ The Build Journey 1. The "Secret Weapon": Temporal Memory Engine This is the heart of the project. I wrote a custom Python class that sits between the vision processor and the LLM. Instead of feeding raw video frames to the LLM (which is slow and expensive), I feed it structured event logs. Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Key Capabilities 👁️ Real-time Tracking: Uses YOLO26 Nano + ByteTrack to assign persistent IDs to objects. 🕰️ Time Travel: Can answer "What did I hold up 2 minutes ago?" 📍 Spatial Awareness: Converts raw coordinates into human terms like "top-left" or "center." 🗣️ Voice Interaction: Full duplex voice conversation with <1s latency. 💬 Real Conversations With ARGUS These are actual interactions from my testing sessions: Terminal Logs showing events Real-time event logging showing objects appearing and moving. I Said ARGUS Responded "What do you see?" "Person ID:2 at middle-center, Cup ID:3 at bottom-right" "What am I holding?" "You appear to be holding a bottle, ID:7" "What just moved?" "Cup moved from bottom-left to bottom-right at 2:05 PM" "Summarize everything" "Person appeared at center 30s ago. Cup moved left to right at 2:05" ⚡ Response time: ~1 second 🧠 All answers came from temporal memory — not from re-analyzing the video frame. 🏗️ The Architecture & Tech Stack I needed a stack that was fast, cheap, and capable of handling real-time video streams without melting my laptop. Component Technology Why I Chose It Framework Vision Agents SDK It handled all the WebRTC/Audio/Video piping for me. Vision Model YOLO26 Nano Benchmarked at 130ms/frame on CPU. Fast & Accurate. Reasoning Llama 3.3 via OpenRouter Fast inference with tool-calling capabilities. Speech Deepgram (STT) + ElevenLabs (TTS) The lowest latency combo available. Transport Stream Edge Network Kept video latency under 30ms. 🛠️ The Build Journey 1. The "Secret Weapon": Temporal Memory Engine This is the heart of the project. I wrote a custom Python class that sits between the vision processor and the LLM. Instead of feeding raw video frames to the LLM (which is slow and expensive), I feed it structured event logs. CODE_BLOCK: Key Capabilities 👁️ Real-time Tracking: Uses YOLO26 Nano + ByteTrack to assign persistent IDs to objects. 🕰️ Time Travel: Can answer "What did I hold up 2 minutes ago?" 📍 Spatial Awareness: Converts raw coordinates into human terms like "top-left" or "center." 🗣️ Voice Interaction: Full duplex voice conversation with <1s latency. 💬 Real Conversations With ARGUS These are actual interactions from my testing sessions: Terminal Logs showing events Real-time event logging showing objects appearing and moving. I Said ARGUS Responded "What do you see?" "Person ID:2 at middle-center, Cup ID:3 at bottom-right" "What am I holding?" "You appear to be holding a bottle, ID:7" "What just moved?" "Cup moved from bottom-left to bottom-right at 2:05 PM" "Summarize everything" "Person appeared at center 30s ago. Cup moved left to right at 2:05" ⚡ Response time: ~1 second 🧠 All answers came from temporal memory — not from re-analyzing the video frame. 🏗️ The Architecture & Tech Stack I needed a stack that was fast, cheap, and capable of handling real-time video streams without melting my laptop. Component Technology Why I Chose It Framework Vision Agents SDK It handled all the WebRTC/Audio/Video piping for me. Vision Model YOLO26 Nano Benchmarked at 130ms/frame on CPU. Fast & Accurate. Reasoning Llama 3.3 via OpenRouter Fast inference with tool-calling capabilities. Speech Deepgram (STT) + ElevenLabs (TTS) The lowest latency combo available. Transport Stream Edge Network Kept video latency under 30ms. 🛠️ The Build Journey 1. The "Secret Weapon": Temporal Memory Engine This is the heart of the project. I wrote a custom Python class that sits between the vision processor and the LLM. Instead of feeding raw video frames to the LLM (which is slow and expensive), I feed it structured event logs. CODE_BLOCK: [ARGUS MEMORY] Cup (ID:2): Last seen at bottom-right at 12:05 PM. Person (ID:1): Currently visible at center. Event: Cup moved from left to right 30 seconds ago. 2. Building the Custom Processor Using the SDK's VideoProcessorPublisher pattern was intuitive. I could access the raw av.VideoFrame, run my YOLO inference, draw bounding boxes, and push the frame back to the browser. ARGUS Detection View ARGUS tracking objects with persistent IDs and Spatial Zones. 3. Solving the Latency Problem My first prototype had 5-second delays. To fix this, I optimized ruthlessly: Switched from Gemini (Rate limits) to OpenRouter/Llama. Switched YOLO11 to YOLO26 Nano (7.7 FPS on CPU). Used human-readable zones ("top-left") instead of raw coordinates, reducing token usage for the LLM. 🧪 Benchmark Results I ran a diagnostic script to prove efficiency on a standard laptop (No GPU): Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: [ARGUS MEMORY] Cup (ID:2): Last seen at bottom-right at 12:05 PM. Person (ID:1): Currently visible at center. Event: Cup moved from left to right 30 seconds ago. 2. Building the Custom Processor Using the SDK's VideoProcessorPublisher pattern was intuitive. I could access the raw av.VideoFrame, run my YOLO inference, draw bounding boxes, and push the frame back to the browser. ARGUS Detection View ARGUS tracking objects with persistent IDs and Spatial Zones. 3. Solving the Latency Problem My first prototype had 5-second delays. To fix this, I optimized ruthlessly: Switched from Gemini (Rate limits) to OpenRouter/Llama. Switched YOLO11 to YOLO26 Nano (7.7 FPS on CPU). Used human-readable zones ("top-left") instead of raw coordinates, reducing token usage for the LLM. 🧪 Benchmark Results I ran a diagnostic script to prove efficiency on a standard laptop (No GPU): CODE_BLOCK: [ARGUS MEMORY] Cup (ID:2): Last seen at bottom-right at 12:05 PM. Person (ID:1): Currently visible at center. Event: Cup moved from left to right 30 seconds ago. 2. Building the Custom Processor Using the SDK's VideoProcessorPublisher pattern was intuitive. I could access the raw av.VideoFrame, run my YOLO inference, draw bounding boxes, and push the frame back to the browser. ARGUS Detection View ARGUS tracking objects with persistent IDs and Spatial Zones. 3. Solving the Latency Problem My first prototype had 5-second delays. To fix this, I optimized ruthlessly: Switched from Gemini (Rate limits) to OpenRouter/Llama. Switched YOLO11 to YOLO26 Nano (7.7 FPS on CPU). Used human-readable zones ("top-left") instead of raw coordinates, reducing token usage for the LLM. 🧪 Benchmark Results I ran a diagnostic script to prove efficiency on a standard laptop (No GPU): CODE_BLOCK: The Vision Agents SDK was crucial here. Because it handles the video transport efficiently, I could use all my CPU cycles for the actual detection logic. 💡 The "Aha!" Moment The magic happened during a test run. I held up a water bottle, put it down, and waited. Then I asked: Me: "What did I just show you?" ARGUS: "You were holding a bottle (ID:7) at the center of the screen about 15 seconds ago." The Aha Moment It wasn't looking at the bottle now. It remembered. That feeling of interacting with an AI that has object permanence is wild. 🌍 Why This Matters Hackathons often produce cool demos that don't solve real problems. ARGUS solves the context window problem for video. By abstracting video into structured temporal data, we can build agents that: Monitor security feeds for hours and summarize activity. Help find lost items in a room. Analyze workflow efficiency in factories. The Vision Agents SDK made this possible by removing the complexity of WebRTC and audio handling, allowing me to focus entirely on the memory innovation. 🔗 Links & Resources Code Repository: GitHub - [ARGUS](https://github.com/Vaibhav13Shukla/argus) Vision Agents SDK: Star the Repo! Hackathon: Vision Possible Thanks to Stream and WeMakeDevs for this challenge. It pushed me to build something I didn't think was possible in a weekend! Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: The Vision Agents SDK was crucial here. Because it handles the video transport efficiently, I could use all my CPU cycles for the actual detection logic. 💡 The "Aha!" Moment The magic happened during a test run. I held up a water bottle, put it down, and waited. Then I asked: Me: "What did I just show you?" ARGUS: "You were holding a bottle (ID:7) at the center of the screen about 15 seconds ago." The Aha Moment It wasn't looking at the bottle now. It remembered. That feeling of interacting with an AI that has object permanence is wild. 🌍 Why This Matters Hackathons often produce cool demos that don't solve real problems. ARGUS solves the context window problem for video. By abstracting video into structured temporal data, we can build agents that: Monitor security feeds for hours and summarize activity. Help find lost items in a room. Analyze workflow efficiency in factories. The Vision Agents SDK made this possible by removing the complexity of WebRTC and audio handling, allowing me to focus entirely on the memory innovation. 🔗 Links & Resources Code Repository: GitHub - [ARGUS](https://github.com/Vaibhav13Shukla/argus) Vision Agents SDK: Star the Repo! Hackathon: Vision Possible Thanks to Stream and WeMakeDevs for this challenge. It pushed me to build something I didn't think was possible in a weekend! CODE_BLOCK: The Vision Agents SDK was crucial here. Because it handles the video transport efficiently, I could use all my CPU cycles for the actual detection logic. 💡 The "Aha!" Moment The magic happened during a test run. I held up a water bottle, put it down, and waited. Then I asked: Me: "What did I just show you?" ARGUS: "You were holding a bottle (ID:7) at the center of the screen about 15 seconds ago." The Aha Moment It wasn't looking at the bottle now. It remembered. That feeling of interacting with an AI that has object permanence is wild. 🌍 Why This Matters Hackathons often produce cool demos that don't solve real problems. ARGUS solves the context window problem for video. By abstracting video into structured temporal data, we can build agents that: Monitor security feeds for hours and summarize activity. Help find lost items in a room. Analyze workflow efficiency in factories. The Vision Agents SDK made this possible by removing the complexity of WebRTC and audio handling, allowing me to focus entirely on the memory innovation. 🔗 Links & Resources Code Repository: GitHub - [ARGUS](https://github.com/Vaibhav13Shukla/argus) Vision Agents SDK: Star the Repo! Hackathon: Vision Possible Thanks to Stream and WeMakeDevs for this challenge. It pushed me to build something I didn't think was possible in a weekend!