Tools: Building an Agentic RAG Support Assistant with Elastic & Jina

Tools: Building an Agentic RAG Support Assistant with Elastic & Jina

Source: Dev.to

We built an agentic RAG support assistant using Elasticsearch, Jina, and Ollama. It understands natural language questions, retrieves the right docs, reranks them, and returns answers with sources. Here's how it works and how to run it. Priya is a support engineer drowning in tickets. Every morning she opens her dashboard to forty-plus questions from customers. Half of them have answers buried somewhere in the company's knowledge base—thousands of pages of docs, FAQs, and troubleshooting guides. The problem is the search box.
It uses keyword matching. When a customer writes "My dashboard shows yesterday's data," the search returns articles about "data export" and "dashboard customization." Technically it found the words "dashboard" and "data." But it completely missed the intent. Priya ends up manually hunting through docs, wasting twenty minutes per ticket on information that should be instant.
Multiply that across a fifteen-person support team and you're burning fifty hours a week on search that doesn't work. At typical support salaries, that's tens of thousands of dollars per year in lost productivity—not counting the frustrated customers who churn because they waited too long. The knowledge is there. The search just can't surface it.
We built something better: an agentic RAG system that understands intent, retrieves the right docs, and generates answers with sources. No OpenAI—just Elasticsearch, Jina for embeddings and reranking, and Ollama for the LLM. The whole thing runs end-to-end in under two seconds.
The key difference? Keyword search finds words. Vector search finds meaning. When someone asks why their dashboard is stale, we want articles about cache refresh intervals and data pipeline latency—even if those exact phrases never appear in the query. That's what dense embeddings give you. The search finally understands what people are actually asking. The pipeline is straightforward. A user asks a question in plain English. We embed that question with Jina's API—same model we used at ingest time—and run a KNN search in Elasticsearch. That gives us the top twenty most similar chunks from the knowledge base. Then Jina's reranker rescores them. We keep the top five and pass them to Ollama, which generates a concise answer with source citations. If Ollama isn't running, we simply return the top passages with their sources. Either way, the user gets the right info fast. Query → Jina Embed → Elasticsearch (KNN) → Jina Rerank → Ollama → Answer Reranking matters. Vector search is fast but not always precise. It casts a wide net—twenty candidates—and Jina narrows it down. The reranker uses a cross-encoder to score each passage against the query. The top five are usually exactly what you need. Without that step, the wrong article often sneaks in and the LLM ends up parroting irrelevant content. Knowledge base chunks get embedded at ingest time with the same Jina model: 768 dimensions, cosine similarity in Elasticsearch. The index is ready for semantic search out of the box.
Example: the Elasticsearch KNN search call: Elastic Cloud has a fourteen-day free trial—create a deployment, pick the Elasticsearch solution view, and you're set. Jina's free tier covers plenty of embeddings and reranking calls for a demo. Ollama is free and runs entirely offline. Python 3.10 or newer, plus a handful of pip packages, rounds it out. No credit card required for Jina or Ollama; Elastic will ask for one but won't charge during the trial. GitHub (full source): https://github.com/d2Anubis/Agentic-RAG-Support-Assistant Clone the repo and add your keys to .env: ELASTIC_CLOUD_ID, ELASTIC_API_KEY, and JINA_API_KEY. One gotcha: use the Cloud ID from Elastic—the short string that looks like deployment-name:base64stuff—not the full deployment URL. You'll find it in your deployment's connection details. Then run the ingest script once to index the sample knowledge base, and you're ready to ask questions.
Add your credentials to .env: If Ollama isn't running, you still get the top reranked passages with sources. No LLM needed for that. It's useful on its own—support engineers can skim the passages and craft their own reply. With Ollama, you get a synthesized answer in one shot. Both paths work.
The core RAG pipeline—search, rerank, generate—fits in a few lines: search_and_rerank embeds the query with Jina, runs KNN in Elasticsearch, then reranks the hits. generate_answer builds a prompt from the top passages and calls Ollama—or returns the passages if Ollama isn't running. Full implementation and sample KB are in the repo. Jina handles both embeddings and reranking with a single API key. Ollama runs the LLM locally—free, private, and fast enough for this use case. Elasticsearch gives you a proper vector store that scales. Everything has a free tier or trial, so you can run the whole pipeline without spending a dime.
The repo includes a sample knowledge base: seven chunks covering dashboard issues, pipeline latency, cache, exports, billing, and login. You can run it immediately. Swap in your own docs by changing the data path and re-running ingest. The pipeline stays the same. Chunk your content, embed it, index it—then query away. The architecture generalizes to any knowledge base. Ask "Why is my dashboard showing stale data?" and you'll see the top passages: Dashboard data refresh, Data pipeline latency, Cache and real-time sync. With Ollama on, the agent synthesizes them into a short answer—check the refresh interval, enable live sync, verify the pipeline. Without Ollama, you get the raw passages and sources. Either way, Priya gets the right info in seconds instead of twenty minutes. The agent cites sources in brackets so you can trace every claim back to the docs. No more guessing. Priya's search box used to miss the point. Now it understands "dashboard showing stale data" and pulls the right articles—cache refresh, pipeline latency, live sync. The agent turns that into a clear answer with sources. Sub-two-second response. Same question that used to take twenty minutes. You can plug in your own knowledge base, add hybrid search with BM25, or wrap the agent in a simple UI. The repo has everything you need: config, ingest script, RAG pipeline, CLI. From there it's straightforward to point it at Confluence, Notion, or your internal docs. The hard part—semantic search and reranking—is done. The rest is plumbing
*This doc was submitted as part of the Elastic Blogathon. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK:
es.search( index="support-kb", knn={"field": "content_embedding", "query_vector": qvec, "k": 20, "num_candidates": 100}
) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK:
es.search( index="support-kb", knn={"field": "content_embedding", "query_vector": qvec, "k": 20, "num_candidates": 100}
) CODE_BLOCK:
es.search( index="support-kb", knn={"field": "content_embedding", "query_vector": qvec, "k": 20, "num_candidates": 100}
) COMMAND_BLOCK:
ELASTIC_CLOUD_ID=your-deployment:dXMt...
ELASTIC_API_KEY=your-api-key
JINA_API_KEY=jina_xxxxxxxxxxxx
Install dependencies and run:
pip install -r requirements.txt
python -m src.ingest # once, to index the sample KB
python -m src.main "Why is my dashboard showing stale data?" Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
ELASTIC_CLOUD_ID=your-deployment:dXMt...
ELASTIC_API_KEY=your-api-key
JINA_API_KEY=jina_xxxxxxxxxxxx
Install dependencies and run:
pip install -r requirements.txt
python -m src.ingest # once, to index the sample KB
python -m src.main "Why is my dashboard showing stale data?" COMMAND_BLOCK:
ELASTIC_CLOUD_ID=your-deployment:dXMt...
ELASTIC_API_KEY=your-api-key
JINA_API_KEY=jina_xxxxxxxxxxxx
Install dependencies and run:
pip install -r requirements.txt
python -m src.ingest # once, to index the sample KB
python -m src.main "Why is my dashboard showing stale data?" COMMAND_BLOCK:
def ask(query): passages = search_and_rerank(query) # Elastic KNN + Jina rerank if not passages: return "No relevant docs found.", [] answer = generate_answer(query, passages) # Ollama or fallback return answer, passages Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK:
def ask(query): passages = search_and_rerank(query) # Elastic KNN + Jina rerank if not passages: return "No relevant docs found.", [] answer = generate_answer(query, passages) # Ollama or fallback return answer, passages COMMAND_BLOCK:
def ask(query): passages = search_and_rerank(query) # Elastic KNN + Jina rerank if not passages: return "No relevant docs found.", [] answer = generate_answer(query, passages) # Ollama or fallback return answer, passages