Tools: Beyond RAG: Building an AI Companion with "Deep Memory" using Knowledge Graphs

Tools: Beyond RAG: Building an AI Companion with "Deep Memory" using Knowledge Graphs

Source: Dev.to

Why Standard RAG Wasn't Enough ## The Architecture Decision: Full Context Injection ## Introducing Synapse: The Architecture ## How It Works: The "Deep Memory" Pipeline ## Phase A: Conversation (The Chat) ## Phase B: Ingestion (The "Sleep" Cycle) ## Phase C: Hydration (The Awakening) ## The "Killer Feature": Memory Explorer ## Engineering Challenges ## 1. Handling Latency (The Job Queue) ## 2. Handling Flakiness (The Retry Logic) ## 3. Snappy UX ## The Result ## Conclusion I build AI tools to solve my own problems. A while back, I built NutriAgent to track my calories because I wanted to own my raw data. But recently, the problem wasn't mine, it was my wife's. She uses LLMs differently than I do. While I use them for code or quick facts, she uses them as a therapist, a life coach, and a sounding board. Over the last year, she built a massive "Master Prompt" in Notion. It contained her medical history, key life events, emotional triggers, and ongoing projects. It was 35,000 tokens long. Every time she started a new chat, she had to manually copy-paste this wall of text just to get the AI up to speed. If she didn't, the advice was generic and useless. She didn't need a search engine or a simple chat history. She needed a continuous brain. I realized that the standard way we build AI memory with RAG (Retrieval Augmented Generation) wouldn't be enough. So I built Synapse AI Chat. It's an AI architecture that uses a Knowledge Graph to give an LLM "Deep Memory." Here is how I built it, why I chose Knowledge Graphs over Vectors (To be fair, I used both), and how I handled the engineering messiness of making it work. Most AI memory systems today use Vector RAG. You chunk text, turn it into numbers (vectors), and find "similar" chunks later. This works great for finding a specific policy in a PDF, but not that great for modeling human relationships and history. Vectors find similarity, not structure. If my wife tells the AI, "I'm feeling overwhelmed today" a Vector search might pull up a journal entry from three months ago where she mentioned "overwhelm." But a Knowledge Graph understands the story. It knows: "Project A" -> CAUSED -> "Stress" -> RESULTED_IN -> "Overwhelm" I needed the AI to understand causality, not just keywords. Because I was using Google's Gemini models (which have a massive context window), I didn't need to retrieve just 5 small chunks of text. I could inject the entire compiled profile into the prompt. My goal was to turn the raw chat logs into a structured graph, then flatten it back into a comprehensive "User Manual" for the AI to read before every interaction. Graphiti, the framework I used for the graph indexing, supports semantic search for a retrieval strategy. I decided to take advantage of the Gemini's big context windows. The compiled graph output ended up being smaller than the source, from almost 35k tokens to ~14k, just combining the entities with their descriptions and their relations in plain text, avoiding extra tokens to build a narrative prompt like her old master's prompt I split the project into two parts: the Body (the UI you talk to) and the Brain (the API that processes memory). Here is the high-level view: The system operates in three distinct phases. When my wife chats with Synapse, she is talking to Gemini 2.5 Flash. It’s fast and fluid. The trick is that the System Prompt isn't static. Before she sends her first message, I hydrate the prompt with a text summary of her entire Knowledge Graph. The AI immediately knows who she is, what she's worried about, and who her friends are. This is where the magic happens. When she finishes a conversation by stopping chatting for 3 hours or manually clicking a Consolidate button, I treat this like the AI taking a nap to consolidate memories. We send the chat transcript to the Python Cortex. Here, I switch to Gemini 3 Flash. Why the upgrade? Extracting entities from a messy human conversation is hard. If she says, "I stopped taking medication X and started Y," a weaker model might just add "Taking Y" to the graph. Gemini 3 is smart enough to create a generic logic: When she returns, the next session is already prepared with the new compiled graph summary. It doesn't just dump a prompt. It compiles the nodes and edges into a natural language narrative. AI memory is usually a "Black Box." Users don't trust what they can't see. I wanted my wife to be able to audit her own brain. I built a visualizer using react-force-graph. She can see bubbles representing her life: "Work," "Health," "Family." If she sees a connection that is wrong (e.g., the AI thinks she likes a food she actually hates), she can edit the input and re-process the graph with new information like "I actually hate mushrooms now." The system then processes that new input and updates the graph, creating new nodes and relations or invalidating the existing ones. This "Human-in-the-loop" approach builds massive trust. Building this wasn't just about prompt engineering. There were real system challenges. Graph ingestion is slow. It takes anywhere from 60 to 200 seconds for Graphiti and Gemini to process a long conversation and update Neo4j. I couldn't have the UI hang for 3 minutes. I used Convex as a Job Queue. When the session ends, the UI returns immediately. Convex processes the job in the background, updating the UI state to "Processing..." and then "Memory Updated" when it's done. The Gemini API is powerful, but occasionally it throws 503 Service Unavailable errors, especially during heavy graph processing tasks. I implemented an "Event-Driven Retry" system. If the graph build fails, I don't just crash. I schedule a retry with exponential backoff. Convex's real-time sync was a lifesaver here. I didn't have to write complex WebSocket code. If the Python backend updates the status of a memory job in the database, the React UI updates instantly. The tokens streaming is better with convex in the middle, since the backend is connected with convex. If the user's browser is closed or the connection fails, the token generation will continue, passing the answer to Convex and streaming it to the user when it is possible. The catch here is that this could increase the Functions usage since each update will count, so the streaming updates are throttled to 100ms intervals to balance responsiveness with database write efficiency The difference is night and day. Before: My wife dreaded starting a new thread because of the "context set up" tax. She felt like she was constantly repeating herself, and having the responsibility to constanly doing break points to update the Master Prompt with the new data and start a new thread Now: She just talks. The system has a "Deep Memory" of about 10,000 tokens (compressed from months of chats) that is injected automatically. She has different threads for different topics, but they all share the same Cortex. If she mentions a health issue in the "Work" thread (e.g., "My back hurts from sitting"), the "Health" thread knows about it the next time she logs in. This project taught me that we are moving from "Horizontal" AI platforms (like ChatGPT, which knows a little about everything) to "Vertical" AI stacks that know everything about you. I’ve been watching how the ChatGPT and Gemini apps are starting to create user profiles and thread summaries to build this kind of memory. They are chasing the same goal: a truly personalized experience. The key takeaway for me is that Vectors are great for search, but Knowledge Graphs are essential for understanding. I keep enjoying building solutions for real problems. Nowadays, we have powerful tools to build awesome software faster than ever, but I found that having a product vision and the technical understanding to architect a solution is still critical. That is the difference between building a quick prototype and solving a real problem. This project is being used for real by my wife and me, and honestly, this is my favorite part of building products. The fun doesn't end when the architecture is done; it begins when people actually use it. Watching the product evolve, finding bugs, pivoting features, or even realizing that an initial idea didn't make sense at all, that is the journey. Building software is fun, but seeing it come alive and solve actual problems is magical. The project is live at synapse-chat.juandago.dev if you want to see it in action. The code is open source if you want to dig into the implementation: I'd love to hear your impressions and thoughts. Let's continue the conversation on X or connect on LinkedIn. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK: def _format_compilation(definitions: list[str], relationships: list[str]) -> str: sections = [] if definitions: sections.append( "#### 1. CONCEPTUAL DEFINITIONS & IDENTITY ####\n" "# (Understanding what these concepts mean specifically for this user)\n" + "\n".join(definitions) ) if relationships: sections.append( "#### 2. RELATIONAL DYNAMICS & CAUSALITY ####\n" "# (How these concepts interact and evolve over time)\n" + "\n".join(relationships) ) if not sections: return "" content = "\n\n".join(sections) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: def _format_compilation(definitions: list[str], relationships: list[str]) -> str: sections = [] if definitions: sections.append( "#### 1. CONCEPTUAL DEFINITIONS & IDENTITY ####\n" "# (Understanding what these concepts mean specifically for this user)\n" + "\n".join(definitions) ) if relationships: sections.append( "#### 2. RELATIONAL DYNAMICS & CAUSALITY ####\n" "# (How these concepts interact and evolve over time)\n" + "\n".join(relationships) ) if not sections: return "" content = "\n\n".join(sections) COMMAND_BLOCK: def _format_compilation(definitions: list[str], relationships: list[str]) -> str: sections = [] if definitions: sections.append( "#### 1. CONCEPTUAL DEFINITIONS & IDENTITY ####\n" "# (Understanding what these concepts mean specifically for this user)\n" + "\n".join(definitions) ) if relationships: sections.append( "#### 2. RELATIONAL DYNAMICS & CAUSALITY ####\n" "# (How these concepts interact and evolve over time)\n" + "\n".join(relationships) ) if not sections: return "" content = "\n\n".join(sections) COMMAND_BLOCK: export const RETRY_DELAYS_MS = [ 0, // Attempt 1: Immediate 2 * 60_000, // Attempt 2: +2 minutes (let the API cool down) 10 * 60_000, // Attempt 3: +10 minutes 30 * 60_000, // Attempt 4: +30 minutes ]; export const processJob = internalAction({ args: { jobId: v.id("cortex_jobs") }, handler: async (ctx, args) => { const job = await ctx.runQuery(internal.cortexJobs.get, { id: args.jobId }); try { // 1. Do the heavy lifting (Call Gemini 3 Flash) // This is where 503 errors usually happen await ingestGraphData(ctx, job.payload); // 2. Mark complete if successful await ctx.runMutation(internal.cortexJobs.complete, { jobId: args.jobId }); } catch (error) { const nextAttempt = job.attempts + 1; if (nextAttempt >= job.maxAttempts) { // Stop the loop if we've tried too many times await ctx.runMutation(internal.cortexJobs.fail, { jobId: args.jobId, error: String(error) }); } else { // 3. Schedule the retry using Convex's scheduler const delay = RETRY_DELAYS_MS[nextAttempt] ?? 30 * 60_000; await ctx.scheduler.runAfter(delay, internal.processor.processJob, { jobId: args.jobId }); } } }, }); Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: export const RETRY_DELAYS_MS = [ 0, // Attempt 1: Immediate 2 * 60_000, // Attempt 2: +2 minutes (let the API cool down) 10 * 60_000, // Attempt 3: +10 minutes 30 * 60_000, // Attempt 4: +30 minutes ]; export const processJob = internalAction({ args: { jobId: v.id("cortex_jobs") }, handler: async (ctx, args) => { const job = await ctx.runQuery(internal.cortexJobs.get, { id: args.jobId }); try { // 1. Do the heavy lifting (Call Gemini 3 Flash) // This is where 503 errors usually happen await ingestGraphData(ctx, job.payload); // 2. Mark complete if successful await ctx.runMutation(internal.cortexJobs.complete, { jobId: args.jobId }); } catch (error) { const nextAttempt = job.attempts + 1; if (nextAttempt >= job.maxAttempts) { // Stop the loop if we've tried too many times await ctx.runMutation(internal.cortexJobs.fail, { jobId: args.jobId, error: String(error) }); } else { // 3. Schedule the retry using Convex's scheduler const delay = RETRY_DELAYS_MS[nextAttempt] ?? 30 * 60_000; await ctx.scheduler.runAfter(delay, internal.processor.processJob, { jobId: args.jobId }); } } }, }); COMMAND_BLOCK: export const RETRY_DELAYS_MS = [ 0, // Attempt 1: Immediate 2 * 60_000, // Attempt 2: +2 minutes (let the API cool down) 10 * 60_000, // Attempt 3: +10 minutes 30 * 60_000, // Attempt 4: +30 minutes ]; export const processJob = internalAction({ args: { jobId: v.id("cortex_jobs") }, handler: async (ctx, args) => { const job = await ctx.runQuery(internal.cortexJobs.get, { id: args.jobId }); try { // 1. Do the heavy lifting (Call Gemini 3 Flash) // This is where 503 errors usually happen await ingestGraphData(ctx, job.payload); // 2. Mark complete if successful await ctx.runMutation(internal.cortexJobs.complete, { jobId: args.jobId }); } catch (error) { const nextAttempt = job.attempts + 1; if (nextAttempt >= job.maxAttempts) { // Stop the loop if we've tried too many times await ctx.runMutation(internal.cortexJobs.fail, { jobId: args.jobId, error: String(error) }); } else { // 3. Schedule the retry using Convex's scheduler const delay = RETRY_DELAYS_MS[nextAttempt] ?? 30 * 60_000; await ctx.scheduler.runAfter(delay, internal.processor.processJob, { jobId: args.jobId }); } } }, }); - The Frontend (Body): React 19 + Convex. I chose Convex because it handles real-time database syncing effortlessly, which makes the chat feel snappy. - The Cortex (Brain): Python + FastAPI. This does the heavy data processing. - The Memory Engine: Graphiti + Neo4j. - The Models: Gemini 3 Flash: For the "heavy lifting" (building the graph). Gemini 2.5 Flash: For the actual chat (speed and cost). - Gemini 3 Flash: For the "heavy lifting" (building the graph). - Gemini 2.5 Flash: For the actual chat (speed and cost). - Gemini 3 Flash: For the "heavy lifting" (building the graph). - Gemini 2.5 Flash: For the actual chat (speed and cost). - Find node "Medication X". - Mark the relationship as STOPPED. - Create node "Medication Y". - Create relationship STARTED. - Frontend (Body): synapse-chat-ai - Backend (Cortex): synapse-cortex