Tools

28m Hacker News Comments As Vector Embedding Search Dataset 2025

2025-11-28 2 views admin

The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384.

This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data.

The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket

We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.

Create the hackernews table to store the postings & their embeddings and associated attributes:

The id is just an incrementing integer. The additional attributes can be used in predicates to understand vector similarity search combined with post-filtering/pre-filtering as explained in the documentation

To load the dataset from the Parquet file, run the following SQL statement:

Inserting 28.74 million rows into the table will take a few minutes.

Run the following SQL to define and build a vector similarity index on the vector column of the hackernews table:

The parameters and performance considerations for index creation and search are described in the documentation. The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters M and ef_construction. Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality corresponding to selected values.

Source: HackerNews

🏷️ Tags

appcli

28m Hacker News Comments As Vector Embedding Search Dataset 2025

🏷️ Tags

More from Tools

Tools: How to generate a PDF from HTML in Node.js (without Puppeteer)

Tools: How I Manage AI Coding Rules Across Claude Code, Cursor, and Codex With One CLI

Tools: Your Dev Tools Are Leaking Data. Here’s Why I Built Mine to Run Entirely in the Browser.

Tools: Vibe Coding is best for repid development but, most of programmer don't knows about .

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting