Tools: Introducing `everyrow.io/dedupe`: An LLM-based approach to semantic deduplication

Tools: Introducing `everyrow.io/dedupe`: An LLM-based approach to semantic deduplication

Source: Dev.to

How does it work? ## Use it yourself! Deduplicating records is a recurring problem in data engineering and several challenges make it difficult: scale, surface-level variation, context-dependent equivalence and world knowledge. Let's look at an example. We wanted to build a database of AI researchers from academic lab websites. Scraping produced: We used a dataset of 200 researcher profiles scraped from academic lab websites. It was then manually reviewed to establish ground-truth clusters, which we used for evaluation. The data covers name, position, organisation, email, university, and GitHub. GitHub handles are present in ~40% of rows and act as a high-precision but low-recall signal. A go-to approach is fuzzy string matching using libraries like fuzzywuzzy or rapidfuzz. However, these suffer from the threshold problem: set it too low and you catch false positives; set it too high and you miss semantic duplicates like "A. Butoi" ↔ "Alexandra Butoi" which have low character overlap despite being the same person. The alternative is manual review, but with 200 rows requiring ~5 comparisons each, that's hours of tedious work. We benchmarked fuzzy string matching as a baseline. It compares all row pairs using token-sorted string similarity and groups rows exceeding a threshold using Union-Find clustering (a graph algorithm that efficiently merges items into equivalence classes by treating each match as an edge). At t=0.75 it catches more duplicates but risks false merges. At t=0.90 it avoids false merges but misses most semantic duplicates like "T. Gupta" ↔ "Tejus Gupta". We next wanted to try ChatGPT. We upload the CSV and asked it to deduplicate. Let's now present everyrow.io/dedupe! Instead of relying on string similarity thresholds, it uses LLMs to make contextual judgments about whether two records represent the same entity. The system exposes a high-level deduplication operation that accepts a dataset and a natural-language equivalence definition. The equivalence relation can be as descriptive as needed and could also include examples. Accuracy was evaluated by comparing predicted equivalence classes against manually labeled ground truth. We report both row-level accuracy (whether a row is assigned to the correct cluster) and cluster-level accuracy (whether an entire entity cluster is correctly reconstructed). A few examples from everyrow.io/dedupe. Starting with some found matches: ✓ Match: Name abbreviation + org typo ✓ Match: Typo in first name ✓ Match: Career transition ✓ Match: Username-only name ✗ Correctly identified as different people: ⚠ Over-merged: Same institution ⚠ Over-merged: Co-authors ⚠ Over-merged: Co-authors + username names The system implements a multi-stage deduplication pipeline designed to reduce pairwise comparisons while preserving semantic recall. Semantic Item Comparison: Each row is compared against others using an LLM that understands context—recognising that "A. Butoi" and "Alexandra Butoi" are likely the same person, or that "BAIR Lab (Former)" indicates a career transition rather than a different organisation. Association Matrix Construction: Pairwise comparison results are assembled into a matrix of match/no-match decisions. To scale efficiently, items are first clustered by embedding similarity, so only semantically similar items are compared. Equivalence Class Creation: Connected components in the association graph form equivalence classes. If A matches B and B matches C, then A, B, and C form a single cluster representing one entity. Validation: Each multi-member cluster is re-evaluated to catch false positives—cases where the initial comparison was too aggressive. Validation is necessary to mitigate error propagation introduced by transitive closure in the association graph. Candidate Selection: For each equivalence class, the most complete/canonical record is selected as the representative (e.g., preferring "Alexandra Butoi" over "A. Butoi"). The tradeoff: fuzzy matching is 2000x faster and free, but has a 12-16% accuracy gap. For datasets where false merges are costly, the LLM-based approach may be worth the additional runtime and cost. This approach is most appropriate when: Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: from everyrow import create_client, create_session from everyrow.ops import dedupe import pandas as pd input_df = pd.read_csv("researchers.csv") async with create_client() as client: async with create_session(client, name="Researcher Dedupe") as session: result = await dedupe( session=session, input=input_df, equivalence_relation=( "Two rows are duplicates if they represent the same person " "despite different email/organization (career changes). " "Consider name variations like typos, nicknames (Robert/Bob), " "and format differences (John Smith/J. Smith)." ), ) result.data.to_csv("deduplicated.csv", index=False) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: from everyrow import create_client, create_session from everyrow.ops import dedupe import pandas as pd input_df = pd.read_csv("researchers.csv") async with create_client() as client: async with create_session(client, name="Researcher Dedupe") as session: result = await dedupe( session=session, input=input_df, equivalence_relation=( "Two rows are duplicates if they represent the same person " "despite different email/organization (career changes). " "Consider name variations like typos, nicknames (Robert/Bob), " "and format differences (John Smith/J. Smith)." ), ) result.data.to_csv("deduplicated.csv", index=False) CODE_BLOCK: from everyrow import create_client, create_session from everyrow.ops import dedupe import pandas as pd input_df = pd.read_csv("researchers.csv") async with create_client() as client: async with create_session(client, name="Researcher Dedupe") as session: result = await dedupe( session=session, input=input_df, equivalence_relation=( "Two rows are duplicates if they represent the same person " "despite different email/organization (career changes). " "Consider name variations like typos, nicknames (Robert/Bob), " "and format differences (John Smith/J. Smith)." ), ) result.data.to_csv("deduplicated.csv", index=False) - Name variations: "Julie Kallini" vs "J. Kallini", "Moscato, Vincenzo" vs "Vincenzo Moscato" - Typos: "Namoi Saphra" vs "Naomi Saphra", "Bryan Wiledr" vs "Bryan Wilder" - Career changes: Same person listed at "AUTON Lab" and later at "AUTON Lab (Former)" with different emails - GitHub handles: Sometimes the only reliable link between records—"A. Butoi" and "Alexandra Butoi" sharing butoialexandra - Username-only names: Researchers who listed their GitHub handle ("smirchan", "VSAnimator") instead of their real name - 88 clusters lost — unique people incorrectly merged into other records - Only 33% of singletons preserved — people with no duplicates were merged into unrelated records - Only 25% distractor accuracy — people with the same first name but different identities (like "Rohan Saha" and "Rohan Chandra") were incorrectly merged - Row 2: "A. Butoi" — Rycolab, ETH Zurich, butoialexandra - Row 8: "Alexandra Butoi" — Ryoclab (typo), butoialexandra - Row 43: "Namoi Saphra" — nsaphra - Row 47: "Naomi Saphra" — Harvard/BU/EleutherAI, nsaphra - Row 18: "T. Gupta" — AUTON Lab (Former), tejus-gupta - Row 26: "Tejus Gupta" — AUTON Lab, tejus-gupta - Row 55: "smirchan" — Stanford University, smirchan - Row 155: "Suvir Mirchandani" — Stanford CRFM, smirchan - Row 6: "Rohan Saha" — Alberta, simpleParadox - Row 141: "Rohan Chandra" — UT Austin, rohanchandra30 - "Sarah Ball" and "Wen (Lavine) Lai" — both at MCML, PhD students - "Marwa Abdulhai" and "Tejus Gupta" — they co-authored a paper - "Suvir Mirchandani", "Igor Oliveira", and "Vishnu Sarukkai" — all three co-authored the same paper; username-only names made disambiguation harder - Semantic Item Comparison: Each row is compared against others using an LLM that understands context—recognising that "A. Butoi" and "Alexandra Butoi" are likely the same person, or that "BAIR Lab (Former)" indicates a career transition rather than a different organisation. - Association Matrix Construction: Pairwise comparison results are assembled into a matrix of match/no-match decisions. To scale efficiently, items are first clustered by embedding similarity, so only semantically similar items are compared. - Equivalence Class Creation: Connected components in the association graph form equivalence classes. If A matches B and B matches C, then A, B, and C form a single cluster representing one entity. - Validation: Each multi-member cluster is re-evaluated to catch false positives—cases where the initial comparison was too aggressive. Validation is necessary to mitigate error propagation introduced by transitive closure in the association graph. - Candidate Selection: For each equivalence class, the most complete/canonical record is selected as the representative (e.g., preferring "Alexandra Butoi" over "A. Butoi"). - 98% row-level accuracy on a dataset with conflicting signals - 90 seconds processing time and $0.42 LLM cost for 200 records - 4 false positive clusters due to co-authorship signals and shared institution - Semantic judgment is required: Name variations, abbreviations, nicknames - Conflicting signals exist: Same person with different emails/organisations over time - No single reliable key: Can't rely on email or ID alone - Obtain an API key at everyrow.io - Install the SDK: uv pip install everyrow or visit the github page: https://github.com/futuresearch/everyrow-sdk - Define your equivalence relation in natural language - Compare results against your ground-truth labels