Tools: Log Entry 002 - RAG in a box
Source: Dev.to
Today I decided to refresh my Python skills. Nothing fancy, just the basics that carried me through the past years. I installed VS Code, added some helpful extensions, set up a venv and wrote a few test lines. Simple steps, but it felt good to get back into it. My main goal was to build a small RAG application. I wanted a simple retrieval augmented generation setup that I could take with me and run anywhere, as long as Docker is installed. That is why I started calling it Rag in a Box. Before writing any code, I watched some YouTube videos to get an idea of where things stand today. It was a bit underwhelming. Most videos focused on clicking things together with n8n or Zapier. Nice tools, but not what I wanted right now. I wanted to get my hands dirty :D
Spoiler, nothing got dirty here. My expectations were probably too high. So I approached the problem like I would approach a new feature at work. I broke it down into smaller parts. Things that, together, would make this tiny project work. Here is what I noted down: With that list in mind, I looked around for tools that could make this doable. I found several posts and examples using qdrant. It is open source, runs locally and seems perfect for a project like this. It also pulls an LLM model at startup, which is fine for now. To orchestrate the entire workflow, I picked LlamaIndex. I wanted as much flexibility as possible and LlamaIndex makes it easy to connect my data to any LLM backend. But the real heavy lifting happens before that: to turn my documents into searchable numbers (embeddings), I am running a local Hugging Face model. This brings torch and transformers into the picture, allowing me to process data privately without sending it to an external API. To ensure the "exchangeability" I listed in my requirements, I implemented a simple Factory Pattern. I didn't want to hardcode the connection to LM Studio. If I want to switch to GPT-4 or Azure tomorrow, I want to do it by changing a single environment variable, not by rewriting code. Since LM Studio mimics the OpenAI API, I can actually use the standard OpenAI driver, just pointing to my local machine instead of their servers: Next steps were straightforward. I wrote a docker compose file and defined the containers. One for the database, one for ingestion and one for retrieval via API. The rest was mostly following the documentation of each module. Preparation took me around two to three hours. The actual implementation maybe thirty to forty minutes. It is not an enterprise grade RAG system, but it worked. I added a PDF file to the designated folder, ran: and then used curl to ask my system about the document. That moment made me happy. But then I thought, this cannot be everything that AI engineering is about. So I started reading about chunking and clustering. It is impressive how the tools split the work: LlamaIndex handles the chunking automatically, while Qdrant takes care of the vector indexing. All this logic is hidden from me. I am just a handyman with a set of tools. I know how to replace a light bulb, but I do not need to understand how energy travels through the entire grid to make it shine. I know where the fuses are and that is enough for now. This was just the first tiny dip into the water with a very tiny toe.
But it is a start, and I like where this is going. After the initial high of getting a response, I’ve quickly realized that building the "box" was just the tip of the iceberg. It’s one thing to get a clean summary from a simple PDF, but it’s another thing entirely to ensure the system doesn't confidently hallucinate when the data gets messy. I’ve learned that the "magic" is actually in the plumbing, specifically how you chunk the data and how you verify the source of the answer. By letting LlamaIndex handle everything "automatically," I’ve essentially handed over the keys to a black box. If the chunking slices a paragraph in the wrong place, the retrieval fails; if the retrieval fails, the LLM just starts guessing. Moving forward, my focus has to shift from just "orchestrating" tools to actually validating the data pipeline. The infrastructure is solid, but the real engineering challenge is turning "it looks right" into "I can prove this is right." Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse COMMAND_BLOCK:
def get_llm(provider: str): """Factory to switch between Local, OpenAI, or Azure drivers""" if provider == "local": # LM Studio mimics the OpenAI API, so we just point the base_url locally. # No real API key is needed, but the library expects a string. return OpenAI( api_base="http://host.docker.internal:1234/v1", api_key="not-needed", model="local-model" ) elif provider == "openai": return OpenAI(model="gpt-4-turbo") elif provider == "azure": return AzureOpenAI( engine="my-deployment", api_version="2023-05-15" ) else: raise ValueError(f"Unknown provider: {provider}") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
def get_llm(provider: str): """Factory to switch between Local, OpenAI, or Azure drivers""" if provider == "local": # LM Studio mimics the OpenAI API, so we just point the base_url locally. # No real API key is needed, but the library expects a string. return OpenAI( api_base="http://host.docker.internal:1234/v1", api_key="not-needed", model="local-model" ) elif provider == "openai": return OpenAI(model="gpt-4-turbo") elif provider == "azure": return AzureOpenAI( engine="my-deployment", api_version="2023-05-15" ) else: raise ValueError(f"Unknown provider: {provider}") COMMAND_BLOCK:
def get_llm(provider: str): """Factory to switch between Local, OpenAI, or Azure drivers""" if provider == "local": # LM Studio mimics the OpenAI API, so we just point the base_url locally. # No real API key is needed, but the library expects a string. return OpenAI( api_base="http://host.docker.internal:1234/v1", api_key="not-needed", model="local-model" ) elif provider == "openai": return OpenAI(model="gpt-4-turbo") elif provider == "azure": return AzureOpenAI( engine="my-deployment", api_version="2023-05-15" ) else: raise ValueError(f"Unknown provider: {provider}") COMMAND_BLOCK:
docker compose up --build Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
docker compose up --build COMMAND_BLOCK:
docker compose up --build COMMAND_BLOCK:
curl -X POST "http://localhost:8000/query" \ -H "Content-Type: application/json" \ -d '{"query": "gibe me 5 key facts of the document. shouldnt ne longer than a normal sentence"}' Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
curl -X POST "http://localhost:8000/query" \ -H "Content-Type: application/json" \ -d '{"query": "gibe me 5 key facts of the document. shouldnt ne longer than a normal sentence"}' COMMAND_BLOCK:
curl -X POST "http://localhost:8000/query" \ -H "Content-Type: application/json" \ -d '{"query": "gibe me 5 key facts of the document. shouldnt ne longer than a normal sentence"}' CODE_BLOCK:
{ "response": "1. The document discusses an algorithm for generating test cases from sequence diagrams.\n2. It presents a method to transform sequence diagrams into tree representations for analysis.\n3. The system was evaluated using a login feature's sequence diagram and test cases.\n4. The results showed that the generated test cases matched the sequence of messages in the diagram.\n5. Sequence Dependency Tables (SDT) are used to analyze message dependencies in sequence diagrams."
} Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
{ "response": "1. The document discusses an algorithm for generating test cases from sequence diagrams.\n2. It presents a method to transform sequence diagrams into tree representations for analysis.\n3. The system was evaluated using a login feature's sequence diagram and test cases.\n4. The results showed that the generated test cases matched the sequence of messages in the diagram.\n5. Sequence Dependency Tables (SDT) are used to analyze message dependencies in sequence diagrams."
} CODE_BLOCK:
{ "response": "1. The document discusses an algorithm for generating test cases from sequence diagrams.\n2. It presents a method to transform sequence diagrams into tree representations for analysis.\n3. The system was evaluated using a login feature's sequence diagram and test cases.\n4. The results showed that the generated test cases matched the sequence of messages in the diagram.\n5. Sequence Dependency Tables (SDT) are used to analyze message dependencies in sequence diagrams."
} - must be portable, so a Docker container is needed
- must be exchangeable, so instead of a container I could plug in an Azure service
- must read files from a directory and load them into a vector database
- must ingest this data properly
- must retrieve data properly
- must work fully locally, for example with LM Studio but also offer the option to switch to another LLM provider if needed