Tools: How to Build a Simple Persistent Memory Layer for LLM Apps (With Code)

Tools: How to Build a Simple Persistent Memory Layer for LLM Apps (With Code)

Source: Dev.to

Why Stateless LLM Apps Break in Production ## What Is a Persistent Memory Layer? ## Step 1: Install Dependencies ## Step 2: Create a Memory Store ## Step 3: Store Memories ## Step 4: Retrieve Relevant Memories ## Step 5: Build the Context for the LLM ## Step 6: Generate Response ## Why This Works ## Common Pitfalls ## 1. Storing Everything ## 2. Memory Drift ## 3. Cost Explosion ## 4. Latency ## Taking This Further ## Final Thoughts Most LLM-powered apps feel impressive for five minutes. Then they forget everything. You ask a chatbot something. It responds intelligently. You close the tab, come back later, and it behaves like you’ve never met. That’s not a model problem. That’s an architecture problem. In this article, we’ll build a simple persistent memory layer for an LLM app using: By the end, you’ll understand how to move from “stateless prompt wrapper” to a structured LLM system. Most basic LLM apps work like this: Even if you store chat history, once you exceed the context window, you’re forced to truncate earlier messages. Problems this creates: If you're building anything beyond a demo, you need persistent memory. A persistent memory layer: Instead of stuffing everything into context, you retrieve only what matters. Architecture overview: Let’s build a minimal memory system. This creates a simple in-memory FAISS vector store. Every time the user sends something meaningful, embed and store it. Now we can persist interactions semantically. When the user sends a new query, embed it and search the vector index. Now we can pull relevant historical context. This ensures the model receives structured context, not raw history. Now your app has semantic long-term memory. “Dump entire conversation into context” “Retrieve only relevant past knowledge” And most importantly, it shifts your app from demo-tier to architecture-tier. Don’t embed trivial small talk. Store meaningful information only. Over time, irrelevant memories may surface. Consider tagging or pruning. Embedding every interaction can become expensive. Add filtering logic. Vector search is fast, but remote DB calls add delay. Optimize if needed. You can improve this system by: This article shows the core pattern. From here, you can productionize. Prompt engineering is not enough for serious AI products. If your system forgets everything, it’s not intelligent — it’s reactive. Adding a memory layer is one of the simplest architectural upgrades you can make to move beyond basic wrappers. It’s not complicated. It’s just structured design. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: User Input ↓ Embed Input ↓ Store in Vector DB ↓ Retrieve Relevant Past Memories ↓ Build Context ↓ Send to LLM Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: User Input ↓ Embed Input ↓ Store in Vector DB ↓ Retrieve Relevant Past Memories ↓ Build Context ↓ Send to LLM CODE_BLOCK: User Input ↓ Embed Input ↓ Store in Vector DB ↓ Retrieve Relevant Past Memories ↓ Build Context ↓ Send to LLM COMMAND_BLOCK: pip install openai faiss-cpu numpy Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: pip install openai faiss-cpu numpy COMMAND_BLOCK: pip install openai faiss-cpu numpy COMMAND_BLOCK: import faiss import numpy as np from openai import OpenAI client = OpenAI() dimension = 1536 # OpenAI embedding size index = faiss.IndexFlatL2(dimension) memory_texts = [] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: import faiss import numpy as np from openai import OpenAI client = OpenAI() dimension = 1536 # OpenAI embedding size index = faiss.IndexFlatL2(dimension) memory_texts = [] COMMAND_BLOCK: import faiss import numpy as np from openai import OpenAI client = OpenAI() dimension = 1536 # OpenAI embedding size index = faiss.IndexFlatL2(dimension) memory_texts = [] CODE_BLOCK: def add_memory(text): response = client.embeddings.create( model="text-embedding-3-small", input=text ) embedding = np.array(response.data[0].embedding).astype('float32') index.add(np.array([embedding])) memory_texts.append(text) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: def add_memory(text): response = client.embeddings.create( model="text-embedding-3-small", input=text ) embedding = np.array(response.data[0].embedding).astype('float32') index.add(np.array([embedding])) memory_texts.append(text) CODE_BLOCK: def add_memory(text): response = client.embeddings.create( model="text-embedding-3-small", input=text ) embedding = np.array(response.data[0].embedding).astype('float32') index.add(np.array([embedding])) memory_texts.append(text) CODE_BLOCK: add_memory("User prefers short technical explanations.") add_memory("User is building a SaaS AI tool.") Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: add_memory("User prefers short technical explanations.") add_memory("User is building a SaaS AI tool.") CODE_BLOCK: add_memory("User prefers short technical explanations.") add_memory("User is building a SaaS AI tool.") CODE_BLOCK: def retrieve_memories(query, k=3): response = client.embeddings.create( model="text-embedding-3-small", input=query ) query_embedding = np.array(response.data[0].embedding).astype('float32') distances, indices = index.search(np.array([query_embedding]), k) return [memory_texts[i] for i in indices[0] if i < len(memory_texts)] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: def retrieve_memories(query, k=3): response = client.embeddings.create( model="text-embedding-3-small", input=query ) query_embedding = np.array(response.data[0].embedding).astype('float32') distances, indices = index.search(np.array([query_embedding]), k) return [memory_texts[i] for i in indices[0] if i < len(memory_texts)] CODE_BLOCK: def retrieve_memories(query, k=3): response = client.embeddings.create( model="text-embedding-3-small", input=query ) query_embedding = np.array(response.data[0].embedding).astype('float32') distances, indices = index.search(np.array([query_embedding]), k) return [memory_texts[i] for i in indices[0] if i < len(memory_texts)] CODE_BLOCK: def build_prompt(user_input): relevant_memories = retrieve_memories(user_input) memory_section = "\n".join(relevant_memories) return f""" You are an AI assistant. Relevant past information: {memory_section} Current user message: {user_input} Respond accordingly. """ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: def build_prompt(user_input): relevant_memories = retrieve_memories(user_input) memory_section = "\n".join(relevant_memories) return f""" You are an AI assistant. Relevant past information: {memory_section} Current user message: {user_input} Respond accordingly. """ CODE_BLOCK: def build_prompt(user_input): relevant_memories = retrieve_memories(user_input) memory_section = "\n".join(relevant_memories) return f""" You are an AI assistant. Relevant past information: {memory_section} Current user message: {user_input} Respond accordingly. """ CODE_BLOCK: def generate_response(user_input): prompt = build_prompt(user_input) response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "user", "content": prompt} ] ) return response.choices[0].message.content Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: def generate_response(user_input): prompt = build_prompt(user_input) response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "user", "content": prompt} ] ) return response.choices[0].message.content CODE_BLOCK: def generate_response(user_input): prompt = build_prompt(user_input) response = client.chat.completions.create( model="gpt-4o-mini", messages=[ {"role": "user", "content": prompt} ] ) return response.choices[0].message.content - OpenAI embeddings - A lightweight vector store (FAISS) - Basic retrieval logic - User sends input - Input is sent to model - Model responds - Conversation disappears - No long-term personalization - No user memory - Repeated explanations - Poor multi-session experience - Stores meaningful interactions - Converts them into embeddings - Saves them in a vector database - Retrieves relevant memories for future conversations - Retrieved memory - Current user input - Scalability - Token efficiency - Personalization - Adding user IDs for multi-user support - Using persistent storage (e.g., Pinecone, Weaviate, Redis) - Creating memory types (preferences, facts, decisions) - Adding time-decay weighting