Tools
Tools: Building a Production-Ready Healthcare RAG System: A Complete Guide
2026-02-02
0 views
admin
Building a Production-Ready Healthcare RAG System: A Complete Guide ## System Architecture ## High-Level Design ## Component Breakdown ## Key Design Decisions ## Technology Stack ## Implementation: Building the System ## Part 1: Document Ingestion & Chunking ## The Chunking Challenge ## Our Chunking Strategy ## Implementation Results ## Part 2: Vector Store Setup ## Initial Setup: Local-First Approach ## Retrieval with Metadata Filtering ## Performance Analysis ## Local vs. Cloud Trade-offs ## Part 3: Query Pipeline & LLM Generation ## Initial Implementation: Fully Local ## Performance Reality Check ## The Speed Problem ## Source Citation ## Evaluation: Measuring RAG Quality ## The Challenge: How Do You Measure "Good"? ## Two-Layer Evaluation Strategy ## Layer 1: Basic Performance Metrics (Implemented) ## Layer 2: RAGAS Framework (Ready for Production) ## Hallucination Detection (Manual Testing) ## The Bottom Line ## Production Considerations ## Cost Optimization ## Monthly Cost Estimate (1000 users, ~30K queries/month) ## Privacy & Compliance ## HIPAA Considerations ## Deployment Checklist ## Lessons Learned: What I Wish I Knew Before Starting ## 1. Chunking Strategy Makes or Breaks Your System ## 2. Local Models Are Great for Privacy, Terrible for Speed ## 3. Metadata Filtering Is Your Secret Weapon ## 4. Source Citation Builds Trust ## 5. Cost Optimisation Isn't Premature ## What I'd Do Differently ## Next Steps ## Conclusion Healthcare professionals face a growing challenge: critical information is scattered across hundreds of documents—equipment manuals, hospital policies, SOPs, and clinical guidelines. When a nurse needs to know the defibrillation protocol during an emergency, or when a biomedical engineer troubleshoots an X-ray machine at 2 AM, they can't afford to spend 20 minutes searching through PDFs. This is where Retrieval-Augmented Generation (RAG) becomes transformative. Unlike simple document search or standalone LLMs, RAG combines the precision of semantic search with the natural language understanding of large language models. The result: staff can ask questions in plain English and get accurate, source-backed answers instantly. In this tutorial, I'll walk you through building a production-ready healthcare RAG system from scratch. We'll cover: All code is available on GitHub, and by the end, you'll have a working system you can adapt for your own use case. Before diving into code, let's understand how the pieces fit together. Our RAG system consists of four main components: 1. Document Processor 2. Vector Store (ChromaDB) Why ChromaDB over Pinecone?
For this proof-of-concept, ChromaDB offers: Why Custom Chunking?
Medical documents have a unique structure: Generic chunking (e.g., split every 500 tokens) breaks context. We need domain-aware chunking that preserves semantic units. Why Metadata Filtering?
Imagine asking: "What's the defibrillation protocol?" Without filtering: You might get results from policies, SOPs, AND training materials—overwhelming and potentially conflicting. With metadata: Filter by doc_type: "SOP" and department: "Emergency" → precise, actionable answer. Now that we understand the architecture, let's build it. The foundation of any RAG system is how you process documents. Poor chunking = poor retrieval = poor answers. Consider this excerpt from a defibrillation SOP: A naive splitter might break this at "PROCEDURE:" or mid-step. That destroys the critical context that steps 3-5 must happen together. I tested the system with three core healthcare documents: This chunking strategy ensured that: For development and cost control, I started with a fully local stack: Retrieval Statistics: Why I started with local models: Production recommendation: For production systems handling real-time queries, upgrading to OpenAI's text-embedding-3-small would deliver: The architecture supports easy swapping: Key insight: Start local for development, upgrade embeddings for production. The ~$50/month cost is justified by a 20x speed improvement. Test Queries (5 samples): 36 seconds is unacceptable for production. Users won't wait. Performance Comparison: My recommendation: Use GPT-3.5-turbo for production. The 30x speed improvement costs ~$60/month for 30,000 queries—easily justified by user experience. Regardless of LLM choice, always return sources: Why source citation matters in healthcare: Building a RAG system is one thing. Proving it works is another. In healthcare, wrong answers aren't just annoying—they're dangerous. We need rigorous evaluation to ensure our system is both accurate and trustworthy. Traditional ML metrics (accuracy, F1) don't work for RAG systems because: Enter RAGAS (Retrieval-Augmented Generation Assessment). I implemented the evaluation in two phases: These metrics run automatically on every query: Actual Results from My Test Queries: What these metrics tell us: ✅ Answer length variability (45-154 words): System adapts response length to question complexity—concise for simple queries, detailed for complex ones. ✅ Consistent retrieval (5.0 docs avg): System reliably finds relevant context for every query. ⚠️ Response time (36s): Unacceptable for production. Local LLM is the bottleneck. Upgrading to GPT-3.5/4 would reduce this to 1-2 seconds. ✅ Document coverage: All 3 source documents were accessed across queries, indicating good index coverage. For production deployment, I implemented the infrastructure for RAGAS (Retrieval-Augmented Generation Assessment)—the industry standard for evaluating RAG systems. RAGAS measures four critical dimensions: Why I Haven't Run Full RAGAS Yet: RAGAS requires API calls to OpenAI (for LLM-as-judge evaluation), which incurs costs. For this proof-of-concept using local models, I prioritized building the evaluation infrastructure over spending API credits. In production, RAGAS would run: Target RAGAS Metrics for Production: Even without automated RAGAS, I tested hallucination resistance manually: ✅ Result: System correctly refuses to hallucinate information it doesn't have. This is critical in healthcare—better to say "I don't know" than to fabricate potentially dangerous instructions. Evaluation isn't optional in healthcare AI. My two-layer approach: For production deployment, I'd allocate ~$50/month for RAGAS API calls—a small price for confidence that the system isn't hallucinating medical instructions. Key takeaway: Build evaluation infrastructure early, even if you don't run expensive metrics until production. The ability to prove quality is as important as the system itself. You've built a working RAG system. Now comes the hard part: deploying it safely in a healthcare environment. Production isn't just about making code run—it's about reliability, compliance, cost control, and user trust. Running a production RAG system isn't free. Here's the breakdown: Cost per query: ~$0.009 (less than 1 cent) Healthcare data is highly regulated. Here's how to stay compliant: Our system handles medical documents, but NOT patient data. ✅ Safe: "What is the defibrillation procedure?"
❌ Unsafe: "What is John Doe's treatment plan?" (would require PHI handling) If you need to process patient data: Use BAA-compliant services: Encrypt data at rest and in transit Implement access controls and authentication Maintain audit logs (6-year retention for HIPAA) Building this Healthcare RAG system taught me lessons that no tutorial covered. Here's what I learned the hard way. What I thought: "I'll just use LangChain's default text splitter." Reality: Generic chunking destroyed context in medical documents. Lesson: Spend time understanding your document structure. Domain-specific chunking isn't optional—it's the foundation of good retrieval. What I thought: "I'll save money with Ollama and avoid API costs." Reality: 36-second response times killed user experience. Lesson: Use local for experimentation, switch to GPT-3.5 for production. The $60/month is easily justified by a 30x speed improvement. Before metadata filtering: After metadata filtering: Lesson: Metadata isn't just for organization—it's for precision retrieval. Always design your metadata schema upfront. User reaction without sources: "The system says to do X, but I'm not sure I trust it." User reaction with sources: "The system says to do X [from defibrillation_sop.md, page 3]. Got it, that's from our official SOP." Lesson: In high-stakes domains like healthcare, users need to verify your system's answers. Citations aren't optional—they're essential for trust. My initial thinking: "I'll optimize costs once it's in production." Reality: Unoptimized prototype was projecting $800/month in API costs. Quick wins that saved $500/month: Lesson: Implement basic cost controls from the start. It's easier than refactoring later. If I started over today: This project proved the concept. To make it production-ready, I'd focus on: Short-term (1-2 months): Medium-term (3-6 months): Long-term (6-12 months): Building a production-ready RAG system taught me that the code is the easy part. The hard parts are: The good news: RAG systems are incredibly powerful when done right. The ability to instantly search thousands of medical documents and get accurate, source-backed answers is transformative for healthcare workers. The reality: Getting from prototype to production takes 3-4x longer than you think. But it's worth it. If you're building a RAG system: All code for this project is on GitHub: https://github.com/nourhan-ali-ml/Healthcare-RAG-Assistant Questions? Feedback? Find me on LinkedIn or open an issue on GitHub. This article is based on a real implementation but uses synthetic medical documents for demonstration. Always consult official hospital policies and procedures for actual medical guidance. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
┌─────────────────┐
│ Medical Docs │
│ (PDF, DOCX, MD) │
└────────┬────────┘ │ ▼
┌─────────────────────┐
│ Document Processor │
│ - Chunking │
│ - Metadata Extract │
└────────┬────────────┘ │ ▼
┌─────────────────────┐
│ Vector Store │
│ (ChromaDB) │
│ - Embeddings │
│ - Similarity Search│
└────────┬────────────┘ │ ▼
┌─────────────────────┐
│ Query Pipeline │
│ - Retrieval │
│ - LLM Generation │
│ - Source Citation │
└─────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
┌─────────────────┐
│ Medical Docs │
│ (PDF, DOCX, MD) │
└────────┬────────┘ │ ▼
┌─────────────────────┐
│ Document Processor │
│ - Chunking │
│ - Metadata Extract │
└────────┬────────────┘ │ ▼
┌─────────────────────┐
│ Vector Store │
│ (ChromaDB) │
│ - Embeddings │
│ - Similarity Search│
└────────┬────────────┘ │ ▼
┌─────────────────────┐
│ Query Pipeline │
│ - Retrieval │
│ - LLM Generation │
│ - Source Citation │
└─────────────────────┘ CODE_BLOCK:
┌─────────────────┐
│ Medical Docs │
│ (PDF, DOCX, MD) │
└────────┬────────┘ │ ▼
┌─────────────────────┐
│ Document Processor │
│ - Chunking │
│ - Metadata Extract │
└────────┬────────────┘ │ ▼
┌─────────────────────┐
│ Vector Store │
│ (ChromaDB) │
│ - Embeddings │
│ - Similarity Search│
└────────┬────────────┘ │ ▼
┌─────────────────────┐
│ Query Pipeline │
│ - Retrieval │
│ - LLM Generation │
│ - Source Citation │
└─────────────────────┘ CODE_BLOCK:
WARNING: Ensure patient is not in contact with metal surfaces. PROCEDURE:
1. Turn on the defibrillator
2. Attach electrode pads to the patient's chest
3. Ensure everyone stands clear
4. Press the ANALYZE button
5. If shock advised, press the SHOCK button Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
WARNING: Ensure patient is not in contact with metal surfaces. PROCEDURE:
1. Turn on the defibrillator
2. Attach electrode pads to the patient's chest
3. Ensure everyone stands clear
4. Press the ANALYZE button
5. If shock advised, press the SHOCK button CODE_BLOCK:
WARNING: Ensure patient is not in contact with metal surfaces. PROCEDURE:
1. Turn on the defibrillator
2. Attach electrode pads to the patient's chest
3. Ensure everyone stands clear
4. Press the ANALYZE button
5. If shock advised, press the SHOCK button COMMAND_BLOCK:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List, Dict class MedicalDocumentChunker: def __init__(self): self.chunk_size = 800 # tokens, not characters self.chunk_overlap = 150 # preserve context across chunks # Medical documents have special separators self.separators = [ "\n## ", # Section headers "\n### ", # Subsections "\nWARNING:", # Critical safety info "\nPROCEDURE:", # Step-by-step instructions "\n\n", # Paragraph breaks "\n", # Line breaks ". ", # Sentences ] def chunk_document(self, text: str, metadata: Dict) -> List[Dict]: """ Chunk document while preserving semantic units """ splitter = RecursiveCharacterTextSplitter( chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap, separators=self.separators, length_function=self._token_length ) chunks = splitter.split_text(text) # Enrich each chunk with metadata return [ { "text": chunk, "metadata": { **metadata, "chunk_index": i, "total_chunks": len(chunks) } } for i, chunk in enumerate(chunks) ] def _token_length(self, text: str) -> int: """ Approximate token count (OpenAI uses ~4 chars per token) "" return len(text) // 4 # Usage
chunker = MedicalDocumentChunker() # Process a defibrillation SOP
with open("data/sops/defibrillation.md", "r") as f: text = f.read() chunks = chunker.chunk_document( text=text, metadata={ "doc_type": "SOP", "department": "Emergency", "equipment": "Defibrillator", "last_updated": "2024-01" }
) print(f"Created {len(chunks)} chunks")
# Output: Created 4 chunks Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List, Dict class MedicalDocumentChunker: def __init__(self): self.chunk_size = 800 # tokens, not characters self.chunk_overlap = 150 # preserve context across chunks # Medical documents have special separators self.separators = [ "\n## ", # Section headers "\n### ", # Subsections "\nWARNING:", # Critical safety info "\nPROCEDURE:", # Step-by-step instructions "\n\n", # Paragraph breaks "\n", # Line breaks ". ", # Sentences ] def chunk_document(self, text: str, metadata: Dict) -> List[Dict]: """ Chunk document while preserving semantic units """ splitter = RecursiveCharacterTextSplitter( chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap, separators=self.separators, length_function=self._token_length ) chunks = splitter.split_text(text) # Enrich each chunk with metadata return [ { "text": chunk, "metadata": { **metadata, "chunk_index": i, "total_chunks": len(chunks) } } for i, chunk in enumerate(chunks) ] def _token_length(self, text: str) -> int: """ Approximate token count (OpenAI uses ~4 chars per token) "" return len(text) // 4 # Usage
chunker = MedicalDocumentChunker() # Process a defibrillation SOP
with open("data/sops/defibrillation.md", "r") as f: text = f.read() chunks = chunker.chunk_document( text=text, metadata={ "doc_type": "SOP", "department": "Emergency", "equipment": "Defibrillator", "last_updated": "2024-01" }
) print(f"Created {len(chunks)} chunks")
# Output: Created 4 chunks COMMAND_BLOCK:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from typing import List, Dict class MedicalDocumentChunker: def __init__(self): self.chunk_size = 800 # tokens, not characters self.chunk_overlap = 150 # preserve context across chunks # Medical documents have special separators self.separators = [ "\n## ", # Section headers "\n### ", # Subsections "\nWARNING:", # Critical safety info "\nPROCEDURE:", # Step-by-step instructions "\n\n", # Paragraph breaks "\n", # Line breaks ". ", # Sentences ] def chunk_document(self, text: str, metadata: Dict) -> List[Dict]: """ Chunk document while preserving semantic units """ splitter = RecursiveCharacterTextSplitter( chunk_size=self.chunk_size, chunk_overlap=self.chunk_overlap, separators=self.separators, length_function=self._token_length ) chunks = splitter.split_text(text) # Enrich each chunk with metadata return [ { "text": chunk, "metadata": { **metadata, "chunk_index": i, "total_chunks": len(chunks) } } for i, chunk in enumerate(chunks) ] def _token_length(self, text: str) -> int: """ Approximate token count (OpenAI uses ~4 chars per token) "" return len(text) // 4 # Usage
chunker = MedicalDocumentChunker() # Process a defibrillation SOP
with open("data/sops/defibrillation.md", "r") as f: text = f.read() chunks = chunker.chunk_document( text=text, metadata={ "doc_type": "SOP", "department": "Emergency", "equipment": "Defibrillator", "last_updated": "2024-01" }
) print(f"Created {len(chunks)} chunks")
# Output: Created 4 chunks CODE_BLOCK:
Total indexed chunks: 100
Average chunk size: ~500 characters
Unique sources: 3 documents
Chunk overlap: 150 tokens Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Total indexed chunks: 100
Average chunk size: ~500 characters
Unique sources: 3 documents
Chunk overlap: 150 tokens CODE_BLOCK:
Total indexed chunks: 100
Average chunk size: ~500 characters
Unique sources: 3 documents
Chunk overlap: 150 tokens COMMAND_BLOCK:
from langchain. embeddings import HuggingFaceEmbeddings
from langchain .vectorstores import Chroma
import chromadb # Local embedding model (runs on CPU)
embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'}
) # Initialize ChromaDB with persistence
chroma_client = chromadb.PersistentClient(path="./chroma_db") # Create collection with metadata
collection = chroma_client.get_or_create_collection( name="healthcare_docs", metadata={ "description": "Medical policies, SOPs, and equipment manuals", "embedding_model": "all-MiniLM-L6-v2" }
) # Create vector store
vectorstore = Chroma( client=chroma_client, collection_name="healthcare_docs", embedding_function=embeddings
) # Add documents with metadata filtering
for chunk in chunks: vectorstore.add_texts( texts=[chunk["text"]], metadatas=[chunk["metadata"]] ) print(f"✅ Indexed {len(chunks)} chunks") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from langchain. embeddings import HuggingFaceEmbeddings
from langchain .vectorstores import Chroma
import chromadb # Local embedding model (runs on CPU)
embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'}
) # Initialize ChromaDB with persistence
chroma_client = chromadb.PersistentClient(path="./chroma_db") # Create collection with metadata
collection = chroma_client.get_or_create_collection( name="healthcare_docs", metadata={ "description": "Medical policies, SOPs, and equipment manuals", "embedding_model": "all-MiniLM-L6-v2" }
) # Create vector store
vectorstore = Chroma( client=chroma_client, collection_name="healthcare_docs", embedding_function=embeddings
) # Add documents with metadata filtering
for chunk in chunks: vectorstore.add_texts( texts=[chunk["text"]], metadatas=[chunk["metadata"]] ) print(f"✅ Indexed {len(chunks)} chunks") COMMAND_BLOCK:
from langchain. embeddings import HuggingFaceEmbeddings
from langchain .vectorstores import Chroma
import chromadb # Local embedding model (runs on CPU)
embeddings = HuggingFaceEmbeddings( model_name="sentence-transformers/all-MiniLM-L6-v2", model_kwargs={'device': 'cpu'}
) # Initialize ChromaDB with persistence
chroma_client = chromadb.PersistentClient(path="./chroma_db") # Create collection with metadata
collection = chroma_client.get_or_create_collection( name="healthcare_docs", metadata={ "description": "Medical policies, SOPs, and equipment manuals", "embedding_model": "all-MiniLM-L6-v2" }
) # Create vector store
vectorstore = Chroma( client=chroma_client, collection_name="healthcare_docs", embedding_function=embeddings
) # Add documents with metadata filtering
for chunk in chunks: vectorstore.add_texts( texts=[chunk["text"]], metadatas=[chunk["metadata"]] ) print(f"✅ Indexed {len(chunks)} chunks") COMMAND_BLOCK:
def retrieve_with_filter(query: str, doc_type: str = None, department: str = None, k: int = 5): """ Retrieve relevant chunks with optional metadata filtering """ # Build metadata filter filter_dict = {} if doc_type: filter_dict["doc_type"] = doc_type if department: filter_dict["department"] = department # Perform similarity search results = vectorstore.similarity_search( query=query, k=k, filter=filter_dict if filter_dict else None ) return results # Example: Get defibrillation procedure from SOPs only
results = retrieve_with_filter( query="What is the defibrillation procedure?", doc_type="SOP", department="Emergency", k=5
) print(f"Retrieved {len(results)} relevant chunks")
# Output: Retrieved 5 relevant chunks Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
def retrieve_with_filter(query: str, doc_type: str = None, department: str = None, k: int = 5): """ Retrieve relevant chunks with optional metadata filtering """ # Build metadata filter filter_dict = {} if doc_type: filter_dict["doc_type"] = doc_type if department: filter_dict["department"] = department # Perform similarity search results = vectorstore.similarity_search( query=query, k=k, filter=filter_dict if filter_dict else None ) return results # Example: Get defibrillation procedure from SOPs only
results = retrieve_with_filter( query="What is the defibrillation procedure?", doc_type="SOP", department="Emergency", k=5
) print(f"Retrieved {len(results)} relevant chunks")
# Output: Retrieved 5 relevant chunks COMMAND_BLOCK:
def retrieve_with_filter(query: str, doc_type: str = None, department: str = None, k: int = 5): """ Retrieve relevant chunks with optional metadata filtering """ # Build metadata filter filter_dict = {} if doc_type: filter_dict["doc_type"] = doc_type if department: filter_dict["department"] = department # Perform similarity search results = vectorstore.similarity_search( query=query, k=k, filter=filter_dict if filter_dict else None ) return results # Example: Get defibrillation procedure from SOPs only
results = retrieve_with_filter( query="What is the defibrillation procedure?", doc_type="SOP", department="Emergency", k=5
) print(f"Retrieved {len(results)} relevant chunks")
# Output: Retrieved 5 relevant chunks CODE_BLOCK:
Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
Vector Store: ChromaDB (persistent, local)
Top-K: 5 results per query
Total indexed: 100 chunks across 3 documents Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
Vector Store: ChromaDB (persistent, local)
Top-K: 5 results per query
Total indexed: 100 chunks across 3 documents CODE_BLOCK:
Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (384 dimensions)
Vector Store: ChromaDB (persistent, local)
Top-K: 5 results per query
Total indexed: 100 chunks across 3 documents CODE_BLOCK:
Average chunks retrieved per query: 5.0
Retrieval success rate: 100% (all queries returned results) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Average chunks retrieved per query: 5.0
Retrieval success rate: 100% (all queries returned results) CODE_BLOCK:
Average chunks retrieved per query: 5.0
Retrieval success rate: 100% (all queries returned results) CODE_BLOCK:
Embedding generation: ~2-3s for 100 chunks
Query embedding: ~0.3s per query
Total retrieval: ~0.5s per query Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Embedding generation: ~2-3s for 100 chunks
Query embedding: ~0.3s per query
Total retrieval: ~0.5s per query CODE_BLOCK:
Embedding generation: ~2-3s for 100 chunks
Query embedding: ~0.3s per query
Total retrieval: ~0.5s per query COMMAND_BLOCK:
# Drop-in replacement for production
from langchain. embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings( model="text-embedding-3-small", openai_api_key=os.getenv("OPENAI_API_KEY")
) # Rest of code remains identical
vectorstore = Chroma( embedding_function=embeddings, # ... same setup
) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Drop-in replacement for production
from langchain. embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings( model="text-embedding-3-small", openai_api_key=os.getenv("OPENAI_API_KEY")
) # Rest of code remains identical
vectorstore = Chroma( embedding_function=embeddings, # ... same setup
) COMMAND_BLOCK:
# Drop-in replacement for production
from langchain. embeddings import OpenAIEmbeddings embeddings = OpenAIEmbeddings( model="text-embedding-3-small", openai_api_key=os.getenv("OPENAI_API_KEY")
) # Rest of code remains identical
vectorstore = Chroma( embedding_function=embeddings, # ... same setup
) COMMAND_BLOCK:
from langchain.llms import Ollama
from langchain. chains import RetrievalQA # Local LLM (runs on CPU/GPU)
llm = Ollama( model="llama3.2:3b", temperature=0.1 # Low temperature for factual responses
) # Build RAG chain
qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), return_source_documents=True
) # Query the system
response = qa_chain({"query": "What is the defibrillation procedure?"}) print(f"Answer: {response['result']}")
print(f"Sources: {[doc.metadata['source'] for doc in response['source_documents']]}") Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from langchain.llms import Ollama
from langchain. chains import RetrievalQA # Local LLM (runs on CPU/GPU)
llm = Ollama( model="llama3.2:3b", temperature=0.1 # Low temperature for factual responses
) # Build RAG chain
qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), return_source_documents=True
) # Query the system
response = qa_chain({"query": "What is the defibrillation procedure?"}) print(f"Answer: {response['result']}")
print(f"Sources: {[doc.metadata['source'] for doc in response['source_documents']]}") COMMAND_BLOCK:
from langchain.llms import Ollama
from langchain. chains import RetrievalQA # Local LLM (runs on CPU/GPU)
llm = Ollama( model="llama3.2:3b", temperature=0.1 # Low temperature for factual responses
) # Build RAG chain
qa_chain = RetrievalQA.from_chain_type( llm=llm, chain_type="stuff", retriever=vectorstore.as_retriever(search_kwargs={"k": 5}), return_source_documents=True
) # Query the system
response = qa_chain({"query": "What is the defibrillation procedure?"}) print(f"Answer: {response['result']}")
print(f"Sources: {[doc.metadata['source'] for doc in response['source_documents']]}") CODE_BLOCK:
Answer Length Statistics: - Average: 87.6 words - Min: 45 words - Max: 154 words Response Time (with local Llama 3.2): - Average: ~36 seconds per query - Retrieval: ~0.5s - LLM generation: ~35.5s (bottleneck!) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Answer Length Statistics: - Average: 87.6 words - Min: 45 words - Max: 154 words Response Time (with local Llama 3.2): - Average: ~36 seconds per query - Retrieval: ~0.5s - LLM generation: ~35.5s (bottleneck!) CODE_BLOCK:
Answer Length Statistics: - Average: 87.6 words - Min: 45 words - Max: 154 words Response Time (with local Llama 3.2): - Average: ~36 seconds per query - Retrieval: ~0.5s - LLM generation: ~35.5s (bottleneck!) COMMAND_BLOCK:
from langchain.chat_models import ChatOpenAI # Production LLM
llm = ChatOpenAI( model="gpt-4-turbo", temperature=0.1, openai_api_key=os.getenv("OPENAI_API_KEY")
) # Same chain, 30x faster
qa_chain = RetrievalQA.from_chain_type( llm=llm, # ... rest identical
) # Now: ~1-2 seconds per query (retrieval + generation) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from langchain.chat_models import ChatOpenAI # Production LLM
llm = ChatOpenAI( model="gpt-4-turbo", temperature=0.1, openai_api_key=os.getenv("OPENAI_API_KEY")
) # Same chain, 30x faster
qa_chain = RetrievalQA.from_chain_type( llm=llm, # ... rest identical
) # Now: ~1-2 seconds per query (retrieval + generation) COMMAND_BLOCK:
from langchain.chat_models import ChatOpenAI # Production LLM
llm = ChatOpenAI( model="gpt-4-turbo", temperature=0.1, openai_api_key=os.getenv("OPENAI_API_KEY")
) # Same chain, 30x faster
qa_chain = RetrievalQA.from_chain_type( llm=llm, # ... rest identical
) # Now: ~1-2 seconds per query (retrieval + generation) CODE_BLOCK:
def query_with_sources(question: str): response = qa_chain({"query": question}) answer = response['result'] sources = [ { "text": doc.page_content[:200], "source": doc.metadata['source'], "doc_type": doc.metadata['doc_type'] } for doc in response['source_documents'] ] return { "answer": answer, "sources": sources, "num_sources": len(sources) } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
def query_with_sources(question: str): response = qa_chain({"query": question}) answer = response['result'] sources = [ { "text": doc.page_content[:200], "source": doc.metadata['source'], "doc_type": doc.metadata['doc_type'] } for doc in response['source_documents'] ] return { "answer": answer, "sources": sources, "num_sources": len(sources) } CODE_BLOCK:
def query_with_sources(question: str): response = qa_chain({"query": question}) answer = response['result'] sources = [ { "text": doc.page_content[:200], "source": doc.metadata['source'], "doc_type": doc.metadata['doc_type'] } for doc in response['source_documents'] ] return { "answer": answer, "sources": sources, "num_sources": len(sources) } COMMAND_BLOCK:
import time
from collections import defaultdict class RAGMetrics: def __init__(self): self.stats = defaultdict(list) def track_query(self, query, answer, retrieved_docs, response_time): """Track basic metrics for each query""" self.stats['answer_lengths'].append(len(answer.split())) self.stats['num_retrieved'].append(len(retrieved_docs)) self.stats['response_times'].append(response_time) def get_summary(self): return { 'avg_answer_length': sum(self.stats['answer_lengths']) / len(self.stats['answer_lengths']), 'min_answer_length': min(self.stats['answer_lengths']), 'max_answer_length': max(self.stats['answer_lengths']), 'avg_response_time': sum(self.stats['response_times']) / len(self.stats['response_times']) } metrics = RAGMetrics() # Track each query
start = time.time()
response = qa_chain({"query": question})
elapsed = time.time() - start metrics.track_query( query=question, answer=response['result'], retrieved_docs=response['source_documents'], response_time=elapsed
) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
import time
from collections import defaultdict class RAGMetrics: def __init__(self): self.stats = defaultdict(list) def track_query(self, query, answer, retrieved_docs, response_time): """Track basic metrics for each query""" self.stats['answer_lengths'].append(len(answer.split())) self.stats['num_retrieved'].append(len(retrieved_docs)) self.stats['response_times'].append(response_time) def get_summary(self): return { 'avg_answer_length': sum(self.stats['answer_lengths']) / len(self.stats['answer_lengths']), 'min_answer_length': min(self.stats['answer_lengths']), 'max_answer_length': max(self.stats['answer_lengths']), 'avg_response_time': sum(self.stats['response_times']) / len(self.stats['response_times']) } metrics = RAGMetrics() # Track each query
start = time.time()
response = qa_chain({"query": question})
elapsed = time.time() - start metrics.track_query( query=question, answer=response['result'], retrieved_docs=response['source_documents'], response_time=elapsed
) COMMAND_BLOCK:
import time
from collections import defaultdict class RAGMetrics: def __init__(self): self.stats = defaultdict(list) def track_query(self, query, answer, retrieved_docs, response_time): """Track basic metrics for each query""" self.stats['answer_lengths'].append(len(answer.split())) self.stats['num_retrieved'].append(len(retrieved_docs)) self.stats['response_times'].append(response_time) def get_summary(self): return { 'avg_answer_length': sum(self.stats['answer_lengths']) / len(self.stats['answer_lengths']), 'min_answer_length': min(self.stats['answer_lengths']), 'max_answer_length': max(self.stats['answer_lengths']), 'avg_response_time': sum(self.stats['response_times']) / len(self.stats['response_times']) } metrics = RAGMetrics() # Track each query
start = time.time()
response = qa_chain({"query": question})
elapsed = time.time() - start metrics.track_query( query=question, answer=response['result'], retrieved_docs=response['source_documents'], response_time=elapsed
) CODE_BLOCK:
================================================================================
Answer Length Statistics
================================================================================ - Average: 87.6 words - Min: 45 words - Max: 154 words Retrieval Statistics: - Average documents retrieved: 5.0 - Unique sources accessed: 3 documents Response Time Metrics: - Average: ~36.32 seconds (local Llama 3.2) - Retrieval: ~0.5s - LLM generation: ~35.8s
================================================================================ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
================================================================================
Answer Length Statistics
================================================================================ - Average: 87.6 words - Min: 45 words - Max: 154 words Retrieval Statistics: - Average documents retrieved: 5.0 - Unique sources accessed: 3 documents Response Time Metrics: - Average: ~36.32 seconds (local Llama 3.2) - Retrieval: ~0.5s - LLM generation: ~35.8s
================================================================================ CODE_BLOCK:
================================================================================
Answer Length Statistics
================================================================================ - Average: 87.6 words - Min: 45 words - Max: 154 words Retrieval Statistics: - Average documents retrieved: 5.0 - Unique sources accessed: 3 documents Response Time Metrics: - Average: ~36.32 seconds (local Llama 3.2) - Retrieval: ~0.5s - LLM generation: ~35.8s
================================================================================ COMMAND_BLOCK:
from ragas import evaluate
from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall
)
from datasets import Dataset # Prepare evaluation dataset
eval_data = { "question": [ "What is the defibrillation procedure?" "What PPE is required in isolation rooms?" "How do I operate the X-ray machine?" "What is the infection control policy for visitors?" "How often should equipment be calibrated?" ], "answer": [], "contexts": [], "ground_truth": [ "Turn on defibrillator, attach pads, stand clear, analyze, shock if advised", "Gown, gloves, mask, eye protection for contact with bodily fluids", "Power on, set exposure parameters, position patient, press exposure button", "Visitors must check in, receive PPE instructions, and limit to 2 per patient" "Critical equipment: monthly. Non-critical: quarterly. Annual external audit" ]
} # Collect RAG outputs
for question in eval_data["question"]: response = query_with_sources(question) eval_data["answer"].append(response["answer"]) eval_data["contexts"].append([src["text"] for src in response["sources"]]) # Convert to RAGAS format
dataset = Dataset.from_dict(eval_data) # Run evaluation (requires OpenAI API)
results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
) print(results) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from ragas import evaluate
from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall
)
from datasets import Dataset # Prepare evaluation dataset
eval_data = { "question": [ "What is the defibrillation procedure?" "What PPE is required in isolation rooms?" "How do I operate the X-ray machine?" "What is the infection control policy for visitors?" "How often should equipment be calibrated?" ], "answer": [], "contexts": [], "ground_truth": [ "Turn on defibrillator, attach pads, stand clear, analyze, shock if advised", "Gown, gloves, mask, eye protection for contact with bodily fluids", "Power on, set exposure parameters, position patient, press exposure button", "Visitors must check in, receive PPE instructions, and limit to 2 per patient" "Critical equipment: monthly. Non-critical: quarterly. Annual external audit" ]
} # Collect RAG outputs
for question in eval_data["question"]: response = query_with_sources(question) eval_data["answer"].append(response["answer"]) eval_data["contexts"].append([src["text"] for src in response["sources"]]) # Convert to RAGAS format
dataset = Dataset.from_dict(eval_data) # Run evaluation (requires OpenAI API)
results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
) print(results) COMMAND_BLOCK:
from ragas import evaluate
from ragas.metrics import ( faithfulness, answer_relevancy, context_precision, context_recall
)
from datasets import Dataset # Prepare evaluation dataset
eval_data = { "question": [ "What is the defibrillation procedure?" "What PPE is required in isolation rooms?" "How do I operate the X-ray machine?" "What is the infection control policy for visitors?" "How often should equipment be calibrated?" ], "answer": [], "contexts": [], "ground_truth": [ "Turn on defibrillator, attach pads, stand clear, analyze, shock if advised", "Gown, gloves, mask, eye protection for contact with bodily fluids", "Power on, set exposure parameters, position patient, press exposure button", "Visitors must check in, receive PPE instructions, and limit to 2 per patient" "Critical equipment: monthly. Non-critical: quarterly. Annual external audit" ]
} # Collect RAG outputs
for question in eval_data["question"]: response = query_with_sources(question) eval_data["answer"].append(response["answer"]) eval_data["contexts"].append([src["text"] for src in response["sources"]]) # Convert to RAGAS format
dataset = Dataset.from_dict(eval_data) # Run evaluation (requires OpenAI API)
results = evaluate( dataset, metrics=[faithfulness, answer_relevancy, context_precision, context_recall]
) print(results) COMMAND_BLOCK:
# Query about non-existent equipment
response = qa_chain({ "query": "What is the procedure for operating the MRI machine?"
}) print(response['result']) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Query about non-existent equipment
response = qa_chain({ "query": "What is the procedure for operating the MRI machine?"
}) print(response['result']) COMMAND_BLOCK:
# Query about non-existent equipment
response = qa_chain({ "query": "What is the procedure for operating the MRI machine?"
}) print(response['result']) CODE_BLOCK:
I don't have information about the MRI machine operation procedures in the provided documents. The available manuals cover X-ray equipment and defibrillators. Please consult the MRI-specific manual or contact the radiology department. Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
I don't have information about the MRI machine operation procedures in the provided documents. The available manuals cover X-ray equipment and defibrillators. Please consult the MRI-specific manual or contact the radiology department. CODE_BLOCK:
I don't have information about the MRI machine operation procedures in the provided documents. The available manuals cover X-ray equipment and defibrillators. Please consult the MRI-specific manual or contact the radiology department. COMMAND_BLOCK:
# Custom separators that respect medical document structure
separators = [ "\n## ", # Major sections "\nWARNING:", # Safety info (always keep together) "\nPROCEDURE:", # Step-by-step (keep complete) "\n\n", # Paragraphs
] Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
# Custom separators that respect medical document structure
separators = [ "\n## ", # Major sections "\nWARNING:", # Safety info (always keep together) "\nPROCEDURE:", # Step-by-step (keep complete) "\n\n", # Paragraphs
] COMMAND_BLOCK:
# Custom separators that respect medical document structure
separators = [ "\n## ", # Major sections "\nWARNING:", # Safety info (always keep together) "\nPROCEDURE:", # Step-by-step (keep complete) "\n\n", # Paragraphs
] CODE_BLOCK:
Query: "What is the defibrillation procedure?"
Retrieved: 5 chunks from policies, SOPs, AND training manuals
Result: Confusing, contradictory information Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Query: "What is the defibrillation procedure?"
Retrieved: 5 chunks from policies, SOPs, AND training manuals
Result: Confusing, contradictory information CODE_BLOCK:
Query: "What is the defibrillation procedure?"
Retrieved: 5 chunks from policies, SOPs, AND training manuals
Result: Confusing, contradictory information CODE_BLOCK:
filter = {"doc_type": "SOP", "department": "Emergency"}
Retrieved: 5 chunks, all from the official defibrillation SOP
Result: Clear, actionable procedure Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
filter = {"doc_type": "SOP", "department": "Emergency"}
Retrieved: 5 chunks, all from the official defibrillation SOP
Result: Clear, actionable procedure CODE_BLOCK:
filter = {"doc_type": "SOP", "department": "Emergency"}
Retrieved: 5 chunks, all from the official defibrillation SOP
Result: Clear, actionable procedure - Document processing: How to chunk medical documents for optimal retrieval
- Vector search: Setting up ChromaDB with metadata filtering
- Query pipeline: Building an intelligent retrieval system
- Evaluation: Using RAGAS metrics to ensure quality
- Production considerations: Privacy, cost, and deployment - Extracts text from various formats (PDF, DOCX, Markdown)
- Implements smart chunking strategies for medical documents
- Preserves document metadata (type, department, equipment model) - Stores document embeddings using OpenAI's text-embedding-3-small
- Enables sub-200ms semantic search
- Supports metadata filtering for precise retrieval - Retrieves relevant chunks based on semantic similarity
- Uses LLM (GPT-4) to generate contextual answers
- Includes source citations for transparency - RAGAS metrics for faithfulness and relevancy
- Hallucination detection to ensure factual accuracy
- Performance monitoring - Zero infrastructure setup (runs locally)
- Perfect for <100K documents
- Easy migration to Pinecone/Weaviate for production scale - Equipment manuals contain step-by-step procedures
- Policies have hierarchical sections
- SOPs include warnings and contraindications - Hospital infection control policy (≈15 pages)
- X-ray equipment user manual (≈25 pages)
- Emergency defibrillation SOP (≈8 pages) - Medical procedures stayed intact (no mid-step breaks)
- Warning sections remained complete
- Equipment specs weren't fragmented - Cost: $0 during development and testing
- Privacy: Medical documents never leave the system
- Experimentation: Easy to iterate without API rate limits
- Offline capability: Works in air-gapped healthcare environments - 10-20x faster embedding generation
- 1536 dimensions (vs 384) = better semantic understanding
- Sub-200ms retrieval latency
- ~$0.0001 per query (negligible cost) - "What is the infection control policy?"
- "How do I operate the X-ray machine?"
- "What is the defibrillation procedure?"
- "What PPE is required in isolation rooms?"
- "How often should medical equipment be calibrated?" - Llama 3.2 (3B parameters) runs on CPU → slow token generation
- Even with GPU, local models are 5-10x slower than OpenAI API
- Good for offline/privacy-critical deployments, terrible for UX - Accountability: Staff can verify information
- Compliance: Audit trail for regulatory requirements
- Trust: Users see where answers come from - Answers are generated text, not classifications
- Multiple valid answers exist for the same question
- We care about both retrieval quality AND generation quality - Faithfulness: Does the answer stick to the retrieved documents? (no hallucinations)
- Answer Relevancy: Does the answer actually address the question?
- Context Precision: Are the top retrieved chunks actually relevant?
- Context Recall: Did we retrieve all relevant information? - During development: After major chunking or retrieval changes
- In CI/CD: Automated tests blocking deployments if scores drop
- In production: Regular sampling (e.g., 100 queries/week) to monitor quality - ✅ Layer 1 runs continuously, catching performance regressions
- ✅ Layer 2 (RAGAS) provides deep quality validation when needed - Use BAA-compliant services: OpenAI Enterprise (BAA available)
Azure OpenAI (HIPAA-compliant)
Self-hosted models (Ollama, local LLMs)
- OpenAI Enterprise (BAA available)
- Azure OpenAI (HIPAA-compliant)
- Self-hosted models (Ollama, local LLMs)
- Encrypt data at rest and in transit
- Implement access controls and authentication
- Maintain audit logs (6-year retention for HIPAA) - OpenAI Enterprise (BAA available)
- Azure OpenAI (HIPAA-compliant)
- Self-hosted models (Ollama, local LLMs) - [ ] Load testing: Can handle 10x expected traffic
- [ ] Security audit: No exposed credentials, encrypted data
- [ ] Backup strategy: Daily vector store snapshots
- [ ] Rollback plan: Can revert to previous version in <5 minutes
- [ ] Documentation: API docs, runbooks, incident response
- [ ] User training: Healthcare staff know how to use system
- [ ] Pilot program: Test with 10-20 users before full rollout
- [ ] Evaluation baseline: RAGAS scores recorded for comparison
- [ ] Compliance review: Legal/compliance team sign-off
- [ ] On-call rotation: 24/7 engineering support - Caching common queries (saved $200/month)
- Using GPT-3.5 for simple queries (saved $180/month)
- Batch processing overnight reports (saved $120/month) - Create test dataset FIRST (Day 1, not Day 20)
- Start with OpenAI embeddings (optimise later, not during POC)
- Design metadata schema upfront (before any document ingestion)
- Implement basic RAGAS from the start
- Build query expansion early (users won't write perfect queries)
- Add caching on Day 1 (saves money and improves speed immediately) - Upgrade to OpenAI embeddings (speed improvement)
- Implement full RAGAS evaluation pipeline
- Add query expansion for better retrieval
- Build an admin dashboard for document management
- User testing with 10-20 healthcare staff - Deploy to the staging environment with real hospital documents
- Implement access controls and audit logging
- Add reranking for improved retrieval precision
- Build feedback loop (thumbs up/down on answers)
- Scale to 100+ users in pilot program - Multi-hospital deployment
- Real-time document ingestion pipeline
- Advanced features (summarisation, comparison, alerts)
- Integration with hospital EHR systems
- Full HIPAA compliance for patient data handling - Understanding your domain deeply (healthcare document structure)
- Designing for your users (nurses don't query like engineers)
- Building trust through transparency (source citations, confidence scores)
- Planning for scale, cost, and compliance from day 1 - Start simple (local models, basic chunking)
- Test continuously (don't wait until the end)
- Optimise strategically (fix retrieval before tuning prompts)
- Document everything (your future self will thank you)
- Plan for production early (compliance, cost, scale)
how-totutorialguidedev.toaimlopenaillmgptswitchgitgithub