Tools

Tools: Zero-Knowledge AI Matching: Binarized Embeddings + Hamming Distance

2026-03-06 0 views admin

Tools: Zero-Knowledge AI Matching: Binarized Embeddings + Hamming Distance

Source: Dev.to

The Problem with Sending Embeddings to the Server ## 🧬 Step 1: Generate Embeddings Locally ## 🧪 Step 2: Normalize ## ⚫ Step 3: Binarize — Floats to Bits ## 🔐 Step 4: Hash for Integrity ## ⚡ Step 5: Hamming Distance on the Backend ## 🧩 The Full Pipeline ## 🧩 How the Two Layers Work Together ## ⚠️ What This Doesn't Protect Against ## 🌑 What This Means for the People Using It ## ▶️ Coming Next: Key-Wrapping in Practice Part 3 of a series on building a privacy-first dating platform for HIV-positive communities. Building a Zero-Knowledge Dating Platform for HIV-Positive Communities covers the architecture. Matching in the Dark: Zero‑Knowledge Filtering Using 32‑Bit Bitmasks covers bitmask filtering. Two people can match on gender, region, marital status, and relationship intent — all without the server understanding any of it. That's the hard filter layer, and it works beautifully. But here's what bitmasks can't tell you: whether two people will actually connect. Someone can check every categorical box and still be a terrible match. The things that create real compatibility — how someone writes about themselves, what they care about, how they think about life — are too rich, too nuanced, too human to reduce to a set of switches. So how do you compute soft compatibility when the server isn't allowed to read a single word of anyone's profile? This is the second half of the matching engine: client-side embeddings, binarization, and Hamming distance. AI-powered matching. Zero semantic leakage. The obvious approach: generate embeddings in the browser, send the float vectors to the server, compute similarity there. The problem: embeddings leak meaning. A 512-dimensional float vector like [0.12, -0.03, 0.88, ...] isn't random noise. It encodes semantic structure. With the right ML tools, you can extract approximate meaning from embeddings — infer topics, reconstruct phrases, identify patterns. Researchers have demonstrated embedding inversion attacks that recover sensitive information from vectors alone. For a general dating app, that's a privacy concern. For HIV-positive users, it's a potential exposure vector. So we can't send floats. We need something the server can compare without being able to understand. The browser uses Universal Sentence Encoder (USE) to convert profile text into a 512-dimensional embedding. This runs entirely client-side — on fields like: The server never sees the text. It never sees the floats. Everything that follows happens before anything leaves the browser. We normalize the vector so its magnitude doesn't affect comparisons — only direction matters: This ensures consistent behaviour across different devices, browsers, and profile lengths. This is where it gets elegant. We convert each float into a single bit based on its sign: 512 floats become a 512-bit binary vector. What this destroys (intentionally): Two people with similar embeddings will have similar binarized vectors. Two people who are very different will have vectors that diverge significantly. The signal survives. The meaning doesn't. This is the step that makes the system genuinely zero-knowledge. As an additional safeguard, we hash the binary vector: The server stores both: Neither reveals the original text. Neither reveals the original floats. A brute-force attack on the hash would require iterating over a space so large it's computationally infeasible. Now the server can compute similarity — without understanding what it's comparing. Hamming distance counts the number of bit positions where two vectors differ: Lower distance = more similar profiles. This gives a similarity score between 0 and 1. The server ranks potential matches by score, returns the ranked IDs, and the frontend decrypts each profile locally. The server computed meaningful compatibility rankings — while knowing nothing about what made those profiles compatible. The bitmask layer finds possible matches — people who meet categorical criteria. The embedding layer ranks those matches by actual compatibility — the things that are harder to quantify but matter more. Together they form a two-stage zero-knowledge pipeline: Hard filter (bitmask) → Soft ranking (Hamming distance) Neither stage requires the server to read, store, or understand a single word about any user. Being honest about the limits matters — especially for this community. Binarization loses information. Converting 512 floats to 512 bits is lossy. Two people who are 92% similar and 78% similar might end up with the same Hamming distance. The ranking is a useful signal, not a precise measurement. USE itself has biases. Universal Sentence Encoder was trained on general internet text. It may encode cultural, linguistic, or demographic biases in ways that affect match quality for some communities. This is an active area of research and a known limitation of off-the-shelf embedding models. The embedding model is public. USE is open-source. An attacker who knows the model and captures a binary vector could attempt partial reconstruction. Binarization makes this significantly harder — but not theoretically impossible for a well-resourced adversary. The threat model assumes the server is the primary attack surface, not a compromised client. Embedding quality depends on input quality. Short or generic "about me" text produces less useful embeddings. Users who write more give the system more signal to work with — but that also means their vectors carry more information. The tradeoff is inherent. There's a version of this system that would be easier to build: store everything in plaintext, use a recommendation engine, optimize for engagement metrics. That version would work. It would also mean that a single subpoena, a single disgruntled employee, or a single breach could expose the health status, location, and intimate preferences of every person on the platform. For most people, that's a risk worth taking for convenience. For the communities this platform was built for, it isn't. The binarized embedding system isn't perfect. But it means that even if everything goes wrong — the database is leaked, the server is compromised, the company is pressured — the attacker still gets binary vectors and Hamming distances. They don't get profiles. They don't get health information. They don't get names. That gap — between what the system knows and what an attacker could extract — is the whole point. The platform is live at HIVPositiveMatches.com — built on everything this series covers. The matching engine is now complete: bitmasks for hard filtering, embeddings for soft ranking, all computed without the server reading anything. But what happens when a match is made and two users want to actually communicate? In Part 4, I'll walk through the key-wrapping flow that allows two users to exchange encrypted messages — where even the server facilitating the exchange cannot read what's being said. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: "I love hiking and finding good espresso" → [0.12, -0.03, 0.88, ...] Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: "I love hiking and finding good espresso" → [0.12, -0.03, 0.88, ...] CODE_BLOCK: "I love hiking and finding good espresso" → [0.12, -0.03, 0.88, ...] COMMAND_BLOCK: const norm = Math.sqrt(vec.reduce((s, x) => s + x * x, 0)); const normalized = vec.map(x => x / norm); Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: const norm = Math.sqrt(vec.reduce((s, x) => s + x * x, 0)); const normalized = vec.map(x => x / norm); COMMAND_BLOCK: const norm = Math.sqrt(vec.reduce((s, x) => s + x * x, 0)); const normalized = vec.map(x => x / norm); COMMAND_BLOCK: const bits = normalized.map(x => (x >= 0 ? 1 : 0)); Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: const bits = normalized.map(x => (x >= 0 ? 1 : 0)); COMMAND_BLOCK: const bits = normalized.map(x => (x >= 0 ? 1 : 0)); CODE_BLOCK: const hash = sha256(bits.join("")); Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: const hash = sha256(bits.join("")); CODE_BLOCK: const hash = sha256(bits.join("")); CODE_BLOCK: User A: 1 0 1 1 0 1 0 0 1 ... User B: 1 0 1 0 0 1 0 0 1 ... ↑ 1 bit differs → distance = 1 Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: User A: 1 0 1 1 0 1 0 0 1 ... User B: 1 0 1 0 0 1 0 0 1 ... ↑ 1 bit differs → distance = 1 CODE_BLOCK: User A: 1 0 1 1 0 1 0 0 1 ... User B: 1 0 1 0 0 1 0 0 1 ... ↑ 1 bit differs → distance = 1 CODE_BLOCK: Distance = hamming(BinaryA, BinaryB), Score = 1 / (1 + Distance). Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Distance = hamming(BinaryA, BinaryB), Score = 1 / (1 + Distance). CODE_BLOCK: Distance = hamming(BinaryA, BinaryB), Score = 1 / (1 + Distance). CODE_BLOCK: Profile text (browser only) │ ▼ Universal Sentence Encoder (runs locally in browser) │ 512 floats ▼ Normalize vector │ 512 floats (unit length) ▼ Binarize (sign threshold) │ 512 bits ▼ SHA-256 hash │ ▼ Send to backend: [512-bit vector] + [hash] │ ▼ Hamming distance matching (server sees only math) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: Profile text (browser only) │ ▼ Universal Sentence Encoder (runs locally in browser) │ 512 floats ▼ Normalize vector │ 512 floats (unit length) ▼ Binarize (sign threshold) │ 512 bits ▼ SHA-256 hash │ ▼ Send to backend: [512-bit vector] + [hash] │ ▼ Hamming distance matching (server sees only math) CODE_BLOCK: Profile text (browser only) │ ▼ Universal Sentence Encoder (runs locally in browser) │ 512 floats ▼ Normalize vector │ 512 floats (unit length) ▼ Binarize (sign threshold) │ 512 bits ▼ SHA-256 hash │ ▼ Send to backend: [512-bit vector] + [hash] │ ▼ Hamming distance matching (server sees only math) - Education & Employment - Hobbies & Interests - Directionality - Semantic structure - Reversibility - Relative similarity between profiles - Compatibility with fast bitwise operations - The 512-bit vector — used for matching - The SHA-256 hash — used for integrity verification

🏷️ Tags

how-totutorialguidedev.toaimlserverswitchdatabase