New Exa-d: How To Store The Web In S3 2026
Building a modern search engine requires ingesting the entire web and ensuring it is queryable as it changes in real-time. The web has a few properties that make this challenging:
To ensure our index stays current, our crawlers must detect changes from the web, reprocess pages, and regenerate embeddings before the query arrives. Each change triggers a messy cascade of derived features (embeddings, extracted text, metadata) with their own dependencies and update logic.
How do you store and retrieve information from the web in a database?
In this post, we will walk through exa-d, our inhouse data processing framework, designed to handle this complexity at scale.
Before building exa-d, we evaluated traditional data management stacks: data warehouses, SQL transformation layers, and orchestrators before ultimately deciding to build our own data framework optimized around a specific set of priorities:
At Exa, many team members need to simultaneously iterate on new search signals derived from existing data. If each team member wrote bespoke scripts for calculating and updating different columns, this would not only lead to excessive code duplication, but also hamper iteration speed by making it difficult to predict the downstream impact of a change.
A core design choice for exa-d was that engineers interact by declaring relationships between data, not the steps to update them. A good analogy here is to spreadsheets where formulas reference other cells. In exa-d, engineers can focus on making sure their formulas are correct, and trust the framework to handle other concerns such as states, retries, and scheduling. This declarative pattern also allows columns and their relationships to be strictly typed, catching invalid transformations immediately as the code is written.
exa-d was built with this developer ergonomics in mind to declare the dependency graph between artifacts and handle execution automatically.
The dynamic nature of content on the web and the need for rapid iteration means that our data cannot just be stored as a static record, but should be able to support many kinds of flexible updates and augmentations.
Some parts of the web update daily or even hourly, requiring precise replacement of small sections of the index. If a bug gets introduced into our update pipeline, we want to repair exactly the rows that were affected. Other operations occur at a much larger scale, such as when we ship a new model and calculate new embeddings over the enti
Source: HackerNews