28m Hacker News Comments As Vector Embedding Search Dataset 2025

28m Hacker News Comments As Vector Embedding Search Dataset 2025

The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384.

This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data.

The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket

We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.

Create the hackernews table to store the postings & their embeddings and associated attributes:

The id is just an incrementing integer. The additional attributes can be used in predicates to understand vector similarity search combined with post-filtering/pre-filtering as explained in the documentation

To load the dataset from the Parquet file, run the following SQL statement:

Inserting 28.74 million rows into the table will take a few minutes.

Run the following SQL to define and build a vector similarity index on the vector column of the hackernews table:

The parameters and performance considerations for index creation and search are described in the documentation. The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters M and ef_construction. Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality corresponding to selected values.

Source: HackerNews