28m Hacker News Comments As Vector Embedding Search Dataset 2025
The Hacker News dataset contains 28.74 million postings and their vector embeddings. The embeddings were generated using SentenceTransformers model all-MiniLM-L6-v2. The dimension of each embedding vector is 384.
This dataset can be used to walk through the design, sizing and performance aspects for a large scale, real world vector search application built on top of user generated, textual data.
The complete dataset with vector embeddings is made available by ClickHouse as a single Parquet file in a S3 bucket
We recommend users first run a sizing exercise to estimate the storage and memory requirements for this dataset by referring to the documentation.
Create the hackernews table to store the postings & their embeddings and associated attributes:
The id is just an incrementing integer. The additional attributes can be used in predicates to understand vector similarity search combined with post-filtering/pre-filtering as explained in the documentation
To load the dataset from the Parquet file, run the following SQL statement:
Inserting 28.74 million rows into the table will take a few minutes.
Run the following SQL to define and build a vector similarity index on the vector column of the hackernews table:
The parameters and performance considerations for index creation and search are described in the documentation. The statement above uses values of 64 and 512 respectively for the HNSW hyperparameters M and ef_construction. Users need to carefully select optimal values for these parameters by evaluating index build time and search results quality corresponding to selected values.
Source: HackerNews