Tools
Tools: RAG Pipeline Deep Dive: Ingestion, Chunking, Embedding, and Vector Search
2026-01-21
0 views
admin
The RAG Pipeline: An Overview ## Phase 1: Document Ingestion ## Document Loading ## Text Extraction and Cleaning ## Metadata Enrichment ## Phase 2: Chunking - The Critical Step ## Why Chunking Matters ## Pre-Chunking vs Post-Chunking ## Chunking Strategies ## Chunk Overlap ## Splitting Process ## Phase 3: Embedding and Storage ## What Are Embeddings? ## Embedding Models ## Embedding Generation: Outside vs Inside the Database ## Option 1: Generate Embeddings Outside Oracle Database 23ai ## Option 2: Generate Embeddings Inside Oracle Database 23ai ## Phase 4: Vector Indexes for Fast Search ## Why Vector Indexes? ## Oracle AI Vector Search: Supported Index Types ## 1. HNSW: In-Memory Neighbor Graph ## 2. IVF: Neighbor Partition Index ## 3. Hybrid Vector Indexes ## GPU Acceleration ## Phase 5: Retrieval - Finding Relevant Context ## Query Embedding ## Similarity Search ## Top-K Results ## Monitoring Index Accuracy ## Phase 6: Generation - Creating the Final Response ## Prompt Assembly ## LLM Generation ## Best Practices for Production RAG Pipelines ## 1. Optimize Chunking Strategy ## 2. Choose Appropriate Embedding Models ## 3. Implement Proper Indexing ## 4. Monitor and Iterate ## 5. Handle Edge Cases ## Advanced RAG Techniques ## 1. Self-Correction (CRAG) ## 2. Agentic RAG ## 3. GraphRAG ## 4. Multimodal RAG Retrieval-Augmented Generation (RAG) has become the default approach for providing context and memory to AI applications. Understanding the RAG pipeline—from document ingestion to final response generation—is essential for building high-performance AI systems. This comprehensive guide explores each stage of the RAG workflow, with special focus on Oracle Database 23ai's native vector capabilities. A complete RAG pipeline consists of two main phases: Ingestion Phase (Offline):
Documents → Chunking → Embedding → Index/Store in Vector Database Retrieval Phase (Runtime):
User Query → Embed Query → Similarity Search → Top-K Results → LLM Generation → Response The RAG pipeline connects the model to the information it needs at query time. When a user asks a question, the pipeline retrieves the most relevant documents, prepares that text as context, and includes it in the prompt so the model can generate an answer grounded in retrieved material rather than training data alone. Ingestion is the process of loading documents from multiple sources and in multiple formats, transforming them into a structured form suitable for embedding and retrieval. Raw documents—whether PDFs, Word documents, web pages, or database records—must be transformed into manageable chunks suitable for retrieval. A clean corpus improves both retrieval accuracy and embedding quality. Use parsing libraries to extract only the primary content from web pages or PDFs, and remove boilerplate, navigation links, and disclaimers. Each chunk should carry metadata such as topic, date, and source, so that retrieval can filter and rank results more effectively. Important Metadata Fields: Chunking is simply the act of splitting larger documents into smaller units ("chunks"). Each chunk can be individually indexed, embedded, and retrieved. The chunking process directly feeds into the embedding step, where each chunk is converted into vector representations that capture semantic meaning. A common source of inaccurate or unstable answers lies in a structural conflict within the traditional "chunk-embed-retrieve" pipeline: using a single-granularity, fixed-size text chunk to perform two inherently conflicting tasks: This creates a difficult trade-off between "precise but fragmented" and "complete but vague." Pre-chunking is the most common method. It processes documents asynchronously by breaking them into smaller pieces before embedding and storing them in the vector database. This enables fast retrieval at query time since all chunks are pre-computed. Post-chunking takes a different approach by embedding entire documents first, then performing chunking at query time only on the documents that are actually retrieved. The chunked results can be cached, so the system becomes faster over time. Fixed-Length Chunking: This is the simplest approach, segmenting text into equally sized pieces using character, token, or word counts. Adopt recursive/semantic chunkers with 10-20% overlap to maintain context across boundaries. Advanced Chunking Approaches: Recursive Chunking: This method is a bit smarter. It splits text hierarchically based on cues like paragraphs, line breaks, or sentences. If a section is too long, it recursively divides it further. Semantic Chunking: Instead of letting token limits decide where ideas get chopped, it uses embeddings to detect where the meaning actually shifts—grouping semantically related sentences together. Document-Aware Chunking: Preserves document structure (sections, paragraphs, tables) while chunking. Chunk overlap is the number of overlapping characters (or tokens) between adjacent chunks. This ensures that context isn't lost when information spans chunk boundaries. Documents are split recursively until the chunks are small enough, using text splitters that respect: Popular Chunking Libraries: The LLM community often turns to two powerful open-source libraries: LangChain and LlamaIndex: Chunking makes information retrievable; embeddings make it searchable by meaning rather than keywords. Embeddings are numerical representations that capture semantic relationships between text. A piece of text—whether a word, phrase, sentence, or paragraph—is converted to a vector of numbers (typically 384-1536 dimensions). Choose MiniLM/SBERT/Instruct/E5 for text, CLIP for images. Normalize embeddings for consistent similarity calculations. Vector embeddings can be generated outside Oracle Database 23ai using third-party embedding models, then imported into the database for storage and search. Embeddings can also be generated inside Oracle Database 23ai by downloading pretrained embedding models, converting them to ONNX format, and importing the ONNX format models into Database 23ai. ONNX (Open Neural Network Exchange) is an open-source format designed to represent deep learning models. It aims to provide interoperability between different deep learning frameworks. The command can be run to import the ONNX file into the database: Once embeddings are generated and stored, vector indexes are specialized data structures designed for similarity searches in high-dimensional vector spaces. They help accelerate vector similarity searches by reducing the number of distance calculations needed. Exact search to find the closest matches for a given vector are accurate but can be slow, since costly vector distance computations are needed for all vectors in a column. Vector indexes use techniques like clustering, partitioning, and neighbor graphs to group similar vectors together, reducing the search space dramatically. Oracle AI Vector Search supports Hierarchical Navigable Small World (HNSW) indexes and Inverted File (IVF) indexes. Hierarchical Navigable Small World (HNSW) creates a navigable graph structure to locate similar vectors with exceptional speed and accuracy. HNSW is an In-Memory Neighbor Graph Vector Index fully built in-memory. You need to setup a new memory pool with VECTOR_MEMORY_SIZE in the SGA to accommodate it. Since the HNSW index consumes a lot of memory, Oracle Database 23ai makes this index available by allocating a memory area called Vector Pool. To roughly determine the memory size needed to store an HNSW index, use the following formula: 1.3 × number of vectors × number of dimensions × size of your vector dimension type (for example, a FLOAT32 is 4 bytes). HNSW enables sub-100ms retrieval at 95%+ recall—extremely fast for real-time applications. DML operations (insert, update, delete) are not allowed after creating an HNSW index—designed for read-heavy workloads. HNSW indexes are not available in RAC environments. Inverted File Flat (IVF) indexes partition data points into clusters, enabling efficient searching within relevant clusters. IVF Flat index is a Neighbor Partition Vector index built on disk, and its blocks are cached in the regular buffer cache. A significant advantage of an IVF vector index is that it is not constrained by the amount of memory available in the vector pool like an HNSW vector index is. It is designed to enhance search efficiency by narrowing the search area through the use of neighbor partitions or clusters. The center of each partition, or the centroid, represents the average vector for each partition. By default, the number of partitions and their corresponding centroids is set to the square root of the dataset's size. Although an IVF index won't be as fast as an equivalent HNSW index, it can be used for very large data sets and still provide excellent performance when compared to an exhaustive similarity search. Hybrid vector index combines full text and semantic search in one index, allowing users to easily index and query their documents using a combination of full text search and semantic vector search. Use Case:
When you need to match both semantic meaning AND specific keywords/terms in a single query. At Oracle CloudWorld 2024, Oracle demonstrated GPU-accelerated creation of vector embeddings and HNSW index creation using NVIDIA GPUs. When a user sends a request to create an HNSW vector index to Oracle Database, the database intelligently compiles and redirects the task to a Compute Offload Server process that leverages GPUs and the CAGRA algorithm from NVIDIA cuVS library to rapidly generate a graph index. At runtime, when a user submits a query, the retrieval phase locates the most relevant chunks to provide as context to the LLM. The system encodes the user's query with the same model used to embed documents during ingestion. The query vector is compared against all document vectors in the database using a distance metric. Common options include EUCLIDEAN, L2_SQUARED, COSINE, DOT, MANHATTAN, and HAMMING. You can specify the metric during index creation or within the search query itself. Performing Similarity Search: The system searches the index for the nearest vectors, retrieves the top-k chunks, and includes those chunks with the query in the prompt. Top-k returns near duplicates or shallow snippets, so the prompt lacks diversity. Retrieval misses proper nouns, IDs, or acronyms because they occur rarely. A valuable feature is Oracle's provision of functions to report index accuracy based on query vectors used in approximate searches. This allows you to continuously monitor and optimize the accuracy of your vector search implementation. Once relevant chunks are retrieved, they're combined with the user's query and sent to an LLM for response generation. Retrieved chunks are formatted into a coherent context block: The complete prompt (context + query) is sent to the LLM, which generates a response grounded in the retrieved information rather than relying solely on parametric knowledge. Align chunk size with context windows, use task-specific sentence transformers, and default to HNSW + metadata filtering for sub-100ms retrieval at 95%+ recall. Monitor recall@k, latency, and index memory; re-embed on model upgrades. When strong retrieval pipelines return incomplete or irrelevant context, CRAG adds a lightweight feedback loop: before generating, the system checks whether the retrieved set is good enough. If it looks weak, the system triggers another retrieval pass. Agentic RAG systems plan retrieval strategies and iterate based on partial results—dynamically deciding what to retrieve next. Build an advanced RAG system by first structuring data into a knowledge graph, then layering graph-aware retrieval to capture entity relationships. Combining text, images, audio, and video retrieval for richer context. A robust RAG pipeline requires careful attention to every stage: Ingestion: Clean, structured data with rich metadata
Chunking: Balanced chunk sizes with appropriate overlap (10-20%)
Embedding: Semantic representations using appropriate models
Indexing: Fast similarity search with HNSW or IVF indexes
Retrieval: Accurate top-K results with metadata filtering
Generation: Grounded responses with proper citations Oracle Database 23ai provides a complete, integrated platform for RAG with native vector support, ONNX model runtime, advanced indexing (HNSW, IVF, Hybrid), and SQL-native operations—eliminating the need for separate vector databases. By mastering these components and following best practices, you can build production-grade RAG systems that deliver accurate, relevant, and trustworthy AI-powered experiences. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
Chunk 1: [Tokens 0-500]
Chunk 2: [Tokens 450-950] ← 50 token overlap
Chunk 3: [Tokens 900-1400] ← 50 token overlap Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Chunk 1: [Tokens 0-500]
Chunk 2: [Tokens 450-950] ← 50 token overlap
Chunk 3: [Tokens 900-1400] ← 50 token overlap CODE_BLOCK:
Chunk 1: [Tokens 0-500]
Chunk 2: [Tokens 450-950] ← 50 token overlap
Chunk 3: [Tokens 900-1400] ← 50 token overlap COMMAND_BLOCK:
from sentence_transformers import SentenceTransformer # Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2') # Generate embeddings
chunks = ["First chunk text", "Second chunk text"]
embeddings = model.encode(chunks) # Store in Oracle Database 23ai
# INSERT INTO documents (text, embedding) VALUES (?, ?) Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
from sentence_transformers import SentenceTransformer # Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2') # Generate embeddings
chunks = ["First chunk text", "Second chunk text"]
embeddings = model.encode(chunks) # Store in Oracle Database 23ai
# INSERT INTO documents (text, embedding) VALUES (?, ?) COMMAND_BLOCK:
from sentence_transformers import SentenceTransformer # Load embedding model
model = SentenceTransformer('all-MiniLM-L6-v2') # Generate embeddings
chunks = ["First chunk text", "Second chunk text"]
embeddings = model.encode(chunks) # Store in Oracle Database 23ai
# INSERT INTO documents (text, embedding) VALUES (?, ?) CODE_BLOCK:
EXEC dbms_vector.load_onnx_model( 'DM_DUMP', 'all_MiniLM_L12_v2.onnx', 'minilm_l12_v2', JSON('{"function": "embedding", "embeddingOutput": "embedding", "input": {"input": ["DATA"]}}')
); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
EXEC dbms_vector.load_onnx_model( 'DM_DUMP', 'all_MiniLM_L12_v2.onnx', 'minilm_l12_v2', JSON('{"function": "embedding", "embeddingOutput": "embedding", "input": {"input": ["DATA"]}}')
); CODE_BLOCK:
EXEC dbms_vector.load_onnx_model( 'DM_DUMP', 'all_MiniLM_L12_v2.onnx', 'minilm_l12_v2', JSON('{"function": "embedding", "embeddingOutput": "embedding", "input": {"input": ["DATA"]}}')
); CODE_BLOCK:
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'Your text here' AS data) AS embedding
FROM DUAL; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'Your text here' AS data) AS embedding
FROM DUAL; CODE_BLOCK:
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'Your text here' AS data) AS embedding
FROM DUAL; CODE_BLOCK:
-- Configure vector pool
ALTER SYSTEM SET vector_memory_size=800M SCOPE=SPFILE;
-- Restart required -- Create HNSW index
CREATE VECTOR INDEX documents_hnsw_idx ON documents (embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
-- Configure vector pool
ALTER SYSTEM SET vector_memory_size=800M SCOPE=SPFILE;
-- Restart required -- Create HNSW index
CREATE VECTOR INDEX documents_hnsw_idx ON documents (embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95; CODE_BLOCK:
-- Configure vector pool
ALTER SYSTEM SET vector_memory_size=800M SCOPE=SPFILE;
-- Restart required -- Create HNSW index
CREATE VECTOR INDEX documents_hnsw_idx ON documents (embedding)
ORGANIZATION NEIGHBOR GRAPH
DISTANCE COSINE
WITH TARGET ACCURACY 95; CODE_BLOCK:
CREATE VECTOR INDEX documents_ivf_idx ON documents (embedding)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
CREATE VECTOR INDEX documents_ivf_idx ON documents (embedding)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95; CODE_BLOCK:
CREATE VECTOR INDEX documents_ivf_idx ON documents (embedding)
ORGANIZATION NEIGHBOR PARTITIONS
DISTANCE COSINE
WITH TARGET ACCURACY 95; CODE_BLOCK:
-- Generate query embedding
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'What is RAG?' AS data) AS query_vector
FROM DUAL; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
-- Generate query embedding
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'What is RAG?' AS data) AS query_vector
FROM DUAL; CODE_BLOCK:
-- Generate query embedding
SELECT VECTOR_EMBEDDING(minilm_l12_v2 USING 'What is RAG?' AS data) AS query_vector
FROM DUAL; CODE_BLOCK:
SELECT text, VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH APPROX FIRST 5 ROWS ONLY; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
SELECT text, VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH APPROX FIRST 5 ROWS ONLY; CODE_BLOCK:
SELECT text, VECTOR_DISTANCE(embedding, :query_vector, COSINE) AS distance
FROM documents
ORDER BY distance
FETCH APPROX FIRST 5 ROWS ONLY; CODE_BLOCK:
Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text] Question: [User query] Answer based on the provided context: Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text] Question: [User query] Answer based on the provided context: CODE_BLOCK:
Context:
[Chunk 1 text]
[Chunk 2 text]
[Chunk 3 text] Question: [User query] Answer based on the provided context: - Local files (PDF, DOCX, TXT, CSV)
- Web pages and APIs
- Databases and data warehouses
- Cloud storage (S3, GCS, Azure Blob)
- Enterprise systems (SharePoint, Confluence, Notion) - PyPDF2: For PDF documents
- python-docx: For Word documents
- LangChain Document Loaders: Unified interface for multiple formats
- LlamaIndex Data Connectors: Specialized loaders for various sources
- Oracle OracleDocLoader: Native loader for Oracle Database 23ai - Remove Noise: Headers, footers, navigation, ads
- Normalize Structure: Convert to consistent format (JSONL, Parquet)
- Preserve Metadata: Source, timestamp, author, document type
- Handle Special Characters: Normalize encoding (UTF-8)
- Maintain Logical Structure: Keep headings, lists, tables - source_id: Unique document identifier
- title: Document title
- created_date: Document creation timestamp
- author: Document creator
- document_type: PDF, email, webpage
- topic/category: For domain filtering
- page_number: For PDF sources
- chunk_id: Unique identifier for the chunk - Semantic matching (recall): Smaller chunks (100-256 tokens) needed for precise similarity search
- Context understanding (utilization): Larger chunks (1024+ tokens) needed for coherent LLM generation - 300-500 tokens: Typical chunks for most use cases, depending on the model and content type
- 100-256 tokens: For high-precision recall in semantic search
- 1024+ tokens: For providing complete context to LLMs - 10-20% overlap is standard practice
- For 500-token chunks, use 50-100 token overlap
- Maintains continuity without excessive redundancy - Sentence boundaries (don't break mid-sentence)
- Paragraph boundaries (maintain logical grouping)
- Section boundaries (preserve document structure) - LangChain: Broad framework with flexible chunking strategies
- LlamaIndex: Designed specifically for RAG pipelines with sophisticated NodeParsers
- Chonkie: Lightweight, dedicated chunking library focusing solely on text splitting - Semantic similarity: Embeddings that are numerically similar are also semantically similar
- Dense representation: High-dimensional vectors capture nuanced meaning
- Model-specific: Different embedding models produce different vector spaces - OpenAI text-embedding-3-large: 3072 dimensions, high quality
- Cohere embed-v4.0: 1024 dimensions, multimodal (text + images)
- Sentence-BERT (SBERT): 768 dimensions, open-source
- all-MiniLM-L6-v2: 384 dimensions, lightweight and fast - Flexibility in choosing any embedding model
- Can use cloud-based embedding services
- Easy integration with existing ML pipelines - Download Pretrained Model from Hugging Face or similar source
- Convert to ONNX Format using Oracle OML4Py utility
- Import into Database 23ai using DBMS_VECTOR package - Generate Embeddings in SQL: - Oracle Database implements an ONNX runtime directly within the database. This allows you to generate vector embeddings directly within Oracle Database using SQL
- Eliminates data movement between systems
- Simplified architecture and security
- Native integration with Oracle's vector capabilities - Works with massive datasets
- Supports global and local indexes on partitioned tables
- DML operations can degrade accuracy over time. Periodically rebuilding the index may be necessary
- More memory-efficient than HNSW - Cosine Similarity: Most common for text (measures angle between vectors)
- Euclidean Distance: Straight-line distance in vector space
- Dot Product: Efficient for normalized vectors - K=3-5: For focused, specific queries
- K=10-20: For broader research queries
- K=50+: For comprehensive document analysis - Reranking: Rerank with cross-encoders (Cohere ReRank, ColBERT) when precision matters
- Hybrid Search: Combine BM25 (keyword) with dense vectors to boost tail recall
- Metadata Filtering: Filter by date, source, document type before retrieval - Reduces hallucinations
- Provides traceable sources
- Enables up-to-date information
- Maintains factual accuracy - Match embedding dimensions to your dataset size
- Consider domain-specific models for specialized content
- Test multiple models and measure retrieval quality - Small-Medium datasets (< 1M vectors): HNSW for maximum speed
- Large datasets (1M+ vectors): IVF for scalability
- Keyword + Semantic: Hybrid vector indexes - Retrieval accuracy (precision@k, recall@k)
- Query latency (p50, p95, p99)
- Index build time
- Storage requirements - Query reformulation for better retrieval
- Fallback strategies when retrieval fails
- Confidence thresholds for generated responses
- Citation and source tracking
how-totutorialguidedev.toaimlneural networkopenaillmdeep learningservercronnetworknodepython