Tools: Update: I built a GraphRAG demo with FalkorDB’s new SDK, then benchmarked it against Neo4j
into real typed edges.
bench/costs.py — the whole thing.
Blind pairing so the judge never sees "FalkorDB" vs "Neo4j". FalkorDB shipped graphrag-sdk v1.0.0rc1 and I wanted to see how it feels on real content, not a toy dataset. An afternoon of "let me just try it" turned into a few days of "if I'm going to have an opinion, I should measure it against something." The something, obviously, was neo4j-graphrag. Same corpus, same LLM, same embedder, same 25-question set, same blind judge. The whole thing — ingest, 25 queries, and the judge rubric across both stacks — costs about $0.15 to reproduce end-to-end. This is a write-up of what I did, what broke, what the numbers actually say, and what I'd do differently. I'm not here to crown a winner. I'm here to show what it took to compare them honestly. Repo: github.com/FalkorDB/graphrag-sdk-demo. The corpus and the pipeline The corpus is 8 FalkorDB blog posts and case studies, roughly 140 KB of Markdown. Topics range from "what is GraphRAG" to the Securin threat-intel case study to a March 2026 cybersecurity webinar announcement. That mix matters later: some questions are short factual lookups, some need multi-hop joining across documents, some ask for specific numbers buried in a single paragraph. Stack: Python 3.14, graphrag-sdk[litellm]==1.0.0rc1, neo4j-graphrag[openai]>=1.14, LiteLLM, FalkorDB and Neo4j in docker-compose.yml. The LLM is gpt-4o-mini (extraction + generation) at temperature=0. Embedder is text-embedding-3-small (1536 dim). The judge is gpt-4o with temperature=0 and seed=42. Chunking on both sides is fixed-size 1000 with 100 overlap, no approximation. Extraction is open-schema on both sides — no hand-tuned GraphSchema. I wanted the libraries to show their defaults, not my schema design. Part 1 — Getting the FalkorDB demo working
The FalkorDB ingest is almost boring to write down, which is the point. The whole thing fits in roughly 60 lines: async with GraphRAG( connection=ConnectionConfig(host="localhost", graph_name="falkordb_blog_kg"), llm=LiteLLM(model="openai/gpt-4o-mini"), embedder=LiteLLMEmbedder(model="openai/text-embedding-3-small"),) as rag: for path in content_files: text = path.read_text(encoding="utf-8") await rag.ingest(path.stem, text=text) ingest() is per-file; finalize() is the cleanup pass that deduplicates entities, backfills embeddings, and creates the HNSW indexes. After the 8 files, I had 509 nodes, 1 228 edges, 160 LLM calls, ~230 seconds wall time, and a cost of $0.054. For an afternoon of "try the SDK," this is a good story. Then I opened the FalkorDB browser and it was a hairball. Every relationship came out as [:RELATES {rel_type: "USES"}], [:RELATES {rel_type: "INTEGRATES_WITH"}], and so on. The relation type lives as a property on a single generic edge, not as the edge label itself. This is fine for the retriever — it reads the property — but it's ugly in the browser and it's a pain to query by hand (WHERE r.rel_type = 'USES' is not index-accelerated in FalkorDB; the skill docs are explicit about this). So I wrote postprocess.py. Two idempotent passes: `# Promote (:Entity)-[:RELATES {rel_type:'INTEGRATES_WITH'}]->(:Entity) for raw in distinct_rel_types: safe = TYPE_SAFE.sub("", raw.upper().replace(" ", "").replace("-", "_")) graph.query( "MATCH (a)-[r:RELATES {rel_type: $t}]->(b) " f"MERGE (a)-[r2:{safe}]->(b) " "SET r2.fact = r.fact, r2.description = r.description, " " r2.source_chunk_ids = r.source_chunk_ids, r2.spans = r.spans " "DELETE r", params={"t": raw}, )`“ The string-substitution into the Cypher is necessary because relationship types can't be parameterized — so I sanitize the type name to [A-Z0-9_] before interpolating. The result: 336 generic RELATES became 161 real typed edges (INTEGRATES_WITH, SUPPORTS, USES, ...), idempotently, on every re-ingest. The other thing I learned by looking, not by reading docs, is that the SDK does two LLM calls per query: a keyword-extraction pass over the question, then the final generation. Between them the retriever ranks candidate entities deterministically by term frequency — no LLM. I only nailed this down when I ran GRAPH.SLOWLOG against a corrected benchmark harness (my first version was double-counting a call; see the addendum). It matters later in the numbers. Part 2 — "Is this actually good?" My five demo questions produced answers that looked great. That proved nothing. Cherry-picking five queries against a knowledge graph is not evidence; it is ambience. I needed three things: a question set with known ground truth, a comparable second stack so the numbers meant something in context, and a judge that didn't know which stack produced which answer. Before writing any benchmark code I wrote down the fairness constraints: same corpus, same chunking (1000/100, fixed-size), same LLM, same embedder, same 25 questions, same judge with a fixed seed, same price sheet for cost math. Anything I couldn't equalize, I would disclose. Part 3 — Building the Neo4j side This was where it got interesting. neo4j-graphrag is a good library, but the defaults don't give you parity with FalkorDB's out-of-the-box retrieval; you have to build it. Four non-obvious things bit me: Document dedup on a stale path. SimpleKGPipeline writes Document.path = 'document.txt' for every file and deduplicates on that path. If you loop over your files, the second file silently merges into the first. The fix is to rename the freshly-created Document node to the source slug right after each run_async(). Missing chunk vector index. The pipeline writes chunk embeddings as properties, but doesn't always create the vector index to query them. create_vector_index(driver, CHUNK_INDEX, ...) after ingest, manually. No entity embeddings at all. This one took me a while. FalkorDB builds an entity HNSW as part of finalize(); SimpleKGPipeline does not. So after ingest, I walk every Entity, embed name + description in batches of 64, write with db.create.setNodeVectorProperty, and then create an entity_embedding_idx. Without this pass, "entity vector search" on the Neo4j side would have been meaningless and the comparison would have been dishonest. GraphRAG rejects custom composite retrievers. This is the fun one. I wanted a retriever that mirrors FalkorDB's MultiPathRetrieval: vector search over entities with 1-hop fact expansion, plus vector search over chunks. In neo4j-graphrag, the obvious shape is two VectorCypherRetrievers composed into a wrapper. But when you pass a wrapper into GraphRAG(...), its pydantic validation rejects anything that isn't a Retriever subclass. I drove the retrievers directly instead. About 40 lines cleaner: `ENTITY_QUERY = """WITH node, scoreOPTIONAL MATCH (node)-[r]-(nbr:Entity)WITH node, score, collect(DISTINCT { rel: type(r), neighbour: coalesce(nbr.name, nbr.id, ''), fact: coalesce(r.fact, r.description, '') })[..8] AS factsRETURN coalesce(node.name, node.id, '') AS entity_name, labels(node) AS entity_labels, coalesce(node.description, '') AS entity_description, facts, score""" self.entity = VectorCypherRetriever( driver=driver, index_name="entity_embedding_idx", retrieval_query=ENTITY_QUERY, result_formatter=_entity_fmt, embedder=embedder, neo4j_database=NEO4J_DATABASE,)self.chunk = VectorCypherRetriever( driver=driver, index_name="chunk_embedding_idx", retrieval_query=CHUNK_QUERY, result_formatter=_chunk_fmt, embedder=embedder, neo4j_database=NEO4J_DATABASE,)` Then: build context → prompt → llm.invoke(). Behavior is equivalent to what GraphRAG would do internally; I just don't get the pydantic validator in my way. There's a tradeoff worth naming: I did not write a Neo4j equivalent of postprocess.py's typed-edge promotion. I thought about it. I decided that giving Neo4j a cleanup pass that its pipeline doesn't provide would bias toward Neo4j, and the whole point was to compare defaults. So Neo4j keeps its out-of-the-box relation-type distribution (226 types across 785 nodes) while FalkorDB gets the cleanup it natively benefits from (154 types across 504 nodes). I'll come back to this in the asymmetries section. Part 4 — Making cost and token tracking honest Token counting is the part of a benchmark that you'd think is easy and is not. FalkorDB side. LiteLLM returns usage and exposes litellm.completion_cost, but I didn't want two different price sheets (one from LiteLLM's snapshot, one for Neo4j's raw OpenAI usage). I subclassed LiteLLM and LiteLLMEmbedder, overrode ainvoke / ainvoke_messages / _raw_embed_async, captured usage from each call, and sent the numbers through a single price sheet: PRICES = { "openai/gpt-4o-mini": {"in": 0.15, "out": 0.60}, # per 1M tokens "openai/gpt-4o": {"in": 2.50, "out": 10.00}, "openai/text-embedding-3-small": {"in": 0.02},}Neo4j side. This is ugly. neo4j-graphrag.OpenAILLM.LLMResponse doesn't expose token usage at all. The underlying OpenAI client has it on the response object, but the wrapper drops it. So I subclassed OpenAILLM and monkey-patched the client: class TrackingOpenAILLM(OpenAILLM): def __init__(self, *a, **kw): super().__init__(*a, **kw) self.call_count = 0; self.prompt_tokens = 0; self.completion_tokens = 0 sync_create = self.client.chat.completions.create def sync_wrap(**kwargs): r = sync_create(**kwargs) self._record(r, self.model_name) return r self.client.chat.completions.create = sync_wrap # same for async_client Same trick for OpenAIEmbeddings.client.embeddings.create. This is the ugliest code in the repo and I'm at peace with it — it's a benchmark harness, not a library. Both stacks' numbers now go through the same bench/costs.py. No drift, no surprises. Part 5 — The 25-question set and the judge I wrote 25 questions in four categories: Factual (8): single-hop lookups. "What is GraphRAG?" "What ports does FalkorDB expose?"Multi-hop (6): joins across entities or documents. "Which FalkorDB integrations does Securin use together?"Comparative (5): "How does FalkorDB compare to Neo4j for knowledge graphs?" "In-memory vs on-disk tradeoffs?"Numeric (6): specific numbers from the corpus. "What was Securin's average query latency?" "When is the cybersecurity webinar?"Every question is paired with reference_facts (the ground truth from the source document) and expected_source_docs (which files should be hit). The dataclass is frozen and asserts exactly 25 with unique IDs, so you can't silently drift the set. The judge lives in bench/judge.py. It's blind A/B: rng = random.Random(42)for q in questions: a, b = (falk, neo) if rng.random() < 0.5 else (neo, falk) # ... send judge Answer A vs Answer B ...Rubric: four dimensions (groundedness, correctness, completeness, conciseness), integer 1–5 per dimension per answer, plus a one-sentence rationale. The judge is given the reference_facts from the question set, so the scoring is grounded in known-correct content, not in the judge's vibes about the question. The judge is gpt-4o with temperature=0, seed=42, response_format={"type": "json_object"}. Running the 25-question rubric cost $0.0456. Every run writes a timestamped JSON under results/ so I can re-render the comparison report later without re-running. Three tables. This is the whole benchmark, stripped of narration. IngestionMetric FalkorDB Neo4jWall time 233.4 s 251.7 sLLM calls 160 159Input / output / embedding tokens 112 925 / 60 211 / 43 759 172 419 / 28 208 / 35 130Cost $0.0539 $0.0435Nodes 504 785Edges 1 202 1 632Entities / chunks 335 / 160 612 / 159Relationship types 154 226 FalkorDB's extractor produces a tighter graph (fewer entities, fewer rel types); Neo4j's produces more fragments. Neither is better in the abstract — different defaults. FalkorDB reads less, writes more output tokens; Neo4j reads more (more prompt context per extraction call), writes less. Per-query aggregates (25 questions)Metric FalkorDB Neo4jAvg retrieve ms 1 493 496Avg LLM ms 1 641 1 759Avg total ms 3 094 2 255Avg LLM calls per Q 2.0 1.0Avg input / output tokens 4 125 / 59 2 952 / 55Avg cost per Q $0.000654 $0.000476p95 latency 4 793 ms 4 506 msAvg retrieved entities / chunks / docs 11.4 / 1.0 / 3.5 3.7 / 10.0 / 2.625-Q total cost $0.01635 $0.01191 Neo4j wins latency (−27 %) and cost (−27 %). Structurally this is because Neo4j does one LLM call per query and FalkorDB does two — keyword extraction, then generation, with deterministic ranking in between. The difference is not a configuration bug; it is what the SDK is doing on your behalf. Judge rubric (gpt-4o, seed=42, blind A/B) Dimension FalkorDB Neo4j ΔGroundedness 3.88 3.84 +0.04Correctness 3.84 3.60 +0.24Completeness 3.52 3.24 +0.28Conciseness 4.56 4.60 −0.04Overall 3.95 3.82 +0.13Category n FalkorDB Neo4jFactual 8 4.38 4.00Multi-hop 6 3.83 3.50Numeric 6 3.92 3.92Comparative 5 3.45 3.80 Win/loss/tie over 25 questions (tie threshold |Δ| ≤ 0.125): 5 / 5 / 15 — tied on wins, but FalkorDB has the higher overall mean. Source-document recall is 97 % vs 90 % in FalkorDB's favour. Reframe, plainly: FalkorDB pays roughly 27 % more latency and 27 % more cost per query. In exchange it wins factual (+0.38) and multi-hop (+0.33) quality, leads on correctness (+0.24) and completeness (+0.28), and retrieves the right source document more often (97 % vs 90 %). It loses comparative questions (−0.35) where its pipeline tends to over-elaborate, and ties numeric extraction — both stacks get the same 4/6 numbers right and fail the same 2. The extra LLM call isn't free, but it's doing work on the substantive categories. Part 7 — Where each one actually failed Aggregates hide the interesting failures. FalkorDB's worst: the comparative category. On c2 ("How does GraphRAG outperform vector RAG on complex questions?") and c3 ("What are the tradeoffs of in-memory vs on-disk graph storage?") FalkorDB produced longer, more elaborated answers than Neo4j. The elaborations weren't fabricated — they were grounded in the retrieved entities — but they drifted past the reference facts the judge was scoring against. Neo4j's tighter single-pass answers hewed closer to exactly what the source said and won both questions. Average comparative score: FalkorDB 3.45 vs Neo4j 3.80. Lesson: the keyword-extraction pre-step pulls a wider entity set into context, which helps on factual and multi-hop questions but can encourage over-explanation on contrast questions where terseness is a virtue. The extra call is doing work; on some categories that work is counterproductive. Neo4j's worst: abstention on broad multi-hop and numeric questions. On m4 ("Which companies or products are described as using or integrating with FalkorDB?") and n5 ("What are FalkorDB's default ports?") Neo4j returned "I don't know based on the provided context." FalkorDB answered both — correctly naming Snowflake and LangChain on m4, though it happened to fail n5 on this run too (retrieval variance; the context didn't surface the ports chunk). In general Neo4j's retriever had the information in context and the single-pass prompt declined to use it. This is the flip side of no rewrite loop. FalkorDB's extra keyword-extraction call pushes the model to use what was retrieved; Neo4j's cautious single prompt occasionally refuses when a broader context pull would have landed the answer. I want to state this plainly: fabrication and abstention are both real failure modes. Neither is strictly worse. In a production system you'd probably tune the prompts to move each one toward the safer behavior for your use case. The point is not that one stack is wrong — it's that they fail differently. The asymmetries I did not fixThree of them, and I called every one out in the repo, in COMPARISON_FULL.md, and I'll call them out here too. Typed-edge promotion runs only on FalkorDB. Porting postprocess.py to Neo4j would have given Neo4j a cleanup its pipeline doesn't provide. I chose to benchmark defaults. Retrieval shape differs. FalkorDB's MultiPathRetrieval returns ~11 entities + 1 chunk per query. My Neo4j composite returns ~4 entities + 10 chunks. Both are tunable; I left them at reasonable defaults for each side. This likely explains part of FalkorDB's edge on multi-hop. Two LLM calls vs one. I did not strip out FalkorDB's keyword-extraction pre-step to "match" Neo4j. It's what the SDK does by default, and measuring the SDK means measuring that work.If you want the benchmark to tell a different story, you can rerun it with adjusted parameters. The harness is ~500 lines and the whole comparison costs fifteen cents. What I'd do differently Start with the benchmark harness, not the demo. The demo's code shaped itself around "five cool queries" and I ended up rewriting half of it when the 25-question set arrived. The right order is: questions → stacks → demo as a special case.Put bench/costs.py in from day one. I burned time reconciling LiteLLM's cost calc with the Neo4j-side raw usage before I realized a single price dict would erase the drift entirely.Expose community summaries in FalkorDB's retrieval. finalize() generates them but MultiPathRetrieval doesn't surface them on short queries. A custom retriever that includes community summaries for broad thematic questions (where the multi-hop expansion doesn't cover the space) is probably worth 15 minutes.Add a "refusal" dimension to the judge rubric (or a fifth score). Right now "I don't know" scores 1/5 on correctness, which is mathematically right — it isn't correct — but doesn't distinguish hallucination from honest abstention. A production benchmark should treat those separately.Use gpt-4o-mini as the judge on a larger sample. gpt-4o on 25 questions is fine for signal; gpt-4o-mini on 250 questions would probably be noisier per-question but more robust in aggregate, for the same budget. A note on the FalkorDB skills packOne thing that shaped how I worked on this: the repo ships a .falkordb-skills/ pack — SKILL.md plus cypher-skills/, operations-skills/, and udf-skills/ subfolders, each containing narrow, tested "how to do X in FalkorDB" notes. Copilot loads them automatically when I'm writing Cypher or operating the container. They're not tutorials; they're a short, opinionated reference for the things that are easy to get wrong. A few places they saved me time — or saved me from shipping something subtly broken: use-merge-to-avoid-duplicates and update-and-remove-properties. My postprocess.py rewrites relationships on every re-ingest. The skill pack is explicit that FalkorDB has no REMOVE clause (set to NULL instead) and that MERGE is the idiom for idempotent upserts. Both shaped the final shape of the typed-edge promotion code.use-parameterized-queries. The same skill is what prompted me to pass rel_type as a parameter (params={"t": raw}) while interpolating the sanitized type name into the Cypher. One is untrusted data; the other is part of the query structure. The skill makes the distinction concrete. track-slow-queries (GRAPH.SLOWLOG). This is how I eventually pinned down that the SDK makes two LLM calls per query — keyword extraction and generation — not three as I initially thought (my benchmark harness was double-counting a call; see the addendum). I wasn't looking for the call count at all; I was looking at what was slow during ingest. GRAPH.SLOWLOG also surfaced the real ingest bottleneck: a single UNWIND $batch AS item MATCH (e:Entity {id:item.eid}) SET e.embedding = vecf32(item.vector) at ~333 ms per call, which is finalize() backfilling entity embeddings. Knowing the bottleneck is the embedding write, not the extraction, changes which optimizations are worth attempting.inspect-graphs-and-memory (GRAPH.LIST / GRAPH.INFO / GRAPH.MEMORY USAGE). This is what query_demo.py uses to print that the whole 509-node / 1 228-edge graph is 4 MB resident. That number is genuinely useful for capacity planning; it's also the kind of thing that's easy to forget to measure.apply-cypher-limitations-correctly. Specifically: <> filters aren't index-accelerated. I avoided at least one "let me just exclude this one type" query in postprocess.py that would have degraded on larger graphs. inspect-query-plans / profile-query-runtime. GRAPH.EXPLAIN and GRAPH.PROFILE. Not cited explicitly in the demo, but referenced in my .github/copilot-instructions.md so that any Cypher generated in this workspace is validated against an explain plan before being considered optimized. The value of the pack, to me, is less that it teaches me Cypher — I can read the docs — and more that it encodes the operational patterns that distinguish a working demo from a working system. Idempotency, introspection, index-awareness, parameter safety. A lot of LLM-generated Cypher looks fine and is quietly not idempotent or quietly scans a whole table. Having the skills loaded means the assistant writes code that wouldn't embarrass me in a review. If you adopt the SDK, copy the skills pack too. It's in the repo under .falkordb-skills/, MIT-licensed. Who actually needs thisFalkorDB's graphrag-sdk is not the right tool for every retrieval problem. Being concrete about who it is for: You should reach for it when: Your corpus has implicit structure the model has to discover — case studies, research reports, customer-support tickets, product documentation, threat-intel feeds. Anything where "what relates to what" isn't already a table. The SDK's open-schema extraction is how you turn that into a graph without designing the graph yourself. Your questions regularly span multiple documents or require joining facts — "which of our customers using product X are on plan tier Y and had an incident last month?" The multi-hop and numeric wins on the judge rubric weren't theoretical; they were on exactly these question shapes. You need provenance on every answer — "which document did that claim come from?" The SDK tracks chunk provenance end-to-end; my query_demo.py prints it. This is the baseline for anything regulated or customer-facing. You're a Python team shipping a chat or agent feature and you do not want to stand up a separate Neo4j instance, a separate vector DB, and a separate ingestion pipeline. FalkorDB runs as a single container alongside your cache (it is a Redis module). The whole stack for this demo is docker compose up -d.You want to keep the graph small and fast — in-memory graph, 4 MB resident for this corpus, sub-millisecond single-hop Cypher. This matters when the graph is inline with a request path, not a nightly batch job. It's less obviously a fit when: Your "knowledge base" is actually a structured database (SaaS CRM, an ERP) where the relationships are already explicit. A graph projection over SQL, or a text-to-SQL pipeline, is a shorter path. The SDK's extraction cost is wasted on content you already have structured.Your workload is pure short-factual lookups where sub-second latency matters more than nuance. Neo4j's single-pass pipeline wins on latency and cost; vector RAG alone would win by even more. Do not pay for a keyword-extraction pre-pass if the question is "what's the phone number on page 3."You need a globally-distributed, multi-region write path. FalkorDB is single-instance or primary-replica. Fine for most apps; not for multi-region active-active.Your corpus is so large (hundreds of GB) that in-memory is not an option. FalkorDB can persist to disk, but the design center is "graph fits comfortably in RAM."If I had to describe the ideal user in one sentence: a Python developer building an agent, copilot, or customer-facing Q&A feature over a few hundred MB to a few GB of unstructured domain content, who needs multi-hop reasoning with source citations, wants ~60 lines of pipeline code to do most of the work, and cares more about being right than being the fastest responder by 300 ms. If that's you, the ~$0.05 ingest and ~$0.00065 per-query numbers from this benchmark are the shape you'll see in production. If that's not you, use something else — and the same harness in this repo will tell you that honestly. Everything is in the repo: the 8 Markdown files, the 25-question set, the judge prompt, all four raw-result JSON directories, the auto-rendered COMPARISON.md, and the hand-written COMPARISON_FULL.md with per-question transcripts and judge rationales. git clone https://github.com/FalkorDB/graphrag-sdk-democd graphrag-sdk-democp .env.example .env && $EDITOR .env # OPENAI_API_KEYdocker compose up -dpython3.14 -m venv .venv && .venv/bin/pip install -r requirements.txt.venv/bin/python benchmark_compare.py all
About fifteen minutes, about fifteen cents. If you want a recommendation, here's the one I'm willing to commit to: if your workload is dominated by factual completeness and multi-hop reasoning where correctness and source recall matter more than latency, FalkorDB's extra keyword-extraction pass is paying for something measurable (+0.38 factual, +0.33 multi-hop, 97 % vs 90 % source-doc recall, +0.24 correctness, +0.28 completeness). If your workload is latency- or cost-sensitive — especially short comparative questions where Neo4j's tighter single-pass prompt actually wins, or numeric extraction where the two stacks tie — Neo4j is 27 % faster and 27 % cheaper and gives equivalent answers. Templates let you quickly answer FAQs or store snippets for re-use. Some comments may only be visible to logged-in visitors. Sign in to view all comments. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse