Tools

New Exa-d: How To Store The Web In S3 2026

2026-01-14 0 views admin

Building a modern search engine requires ingesting the entire web and ensuring it is queryable as it changes in real-time. The web has a few properties that make this challenging:

To ensure our index stays current, our crawlers must detect changes from the web, reprocess pages, and regenerate embeddings before the query arrives. Each change triggers a messy cascade of derived features (embeddings, extracted text, metadata) with their own dependencies and update logic.

How do you store and retrieve information from the web in a database?

In this post, we will walk through exa-d, our inhouse data processing framework, designed to handle this complexity at scale.

Before building exa-d, we evaluated traditional data management stacks: data warehouses, SQL transformation layers, and orchestrators before ultimately deciding to build our own data framework optimized around a specific set of priorities:

At Exa, many team members need to simultaneously iterate on new search signals derived from existing data. If each team member wrote bespoke scripts for calculating and updating different columns, this would not only lead to excessive code duplication, but also hamper iteration speed by making it difficult to predict the downstream impact of a change.

A core design choice for exa-d was that engineers interact by declaring relationships between data, not the steps to update them. A good analogy here is to spreadsheets where formulas reference other cells. In exa-d, engineers can focus on making sure their formulas are correct, and trust the framework to handle other concerns such as states, retries, and scheduling. This declarative pattern also allows columns and their relationships to be strictly typed, catching invalid transformations immediately as the code is written.

exa-d was built with this developer ergonomics in mind to declare the dependency graph between artifacts and handle execution automatically.

The dynamic nature of content on the web and the need for rapid iteration means that our data cannot just be stored as a static record, but should be able to support many kinds of flexible updates and augmentations.

Some parts of the web update daily or even hourly, requiring precise replacement of small sections of the index. If a bug gets introduced into our update pipeline, we want to repair exactly the rows that were affected. Other operations occur at a much larger scale, such as when we ship a new model and calculate new embeddings over the enti

Source: HackerNews

🏷️ Tags

developerframeworkapi

New Exa-d: How To Store The Web In S3 2026

🏷️ Tags

More from Tools

Tools: Why Stripe's API is the Gold Standard: Design Patterns That Every API Builder Should Steal

Tools: Preference Falsification: Why People Hide Their True Opinions

Tools: The Normalization of Deviance: How Acceptable Risk Creeps Upward

Tools: When Past Team Failures Become Your Team's Problem

Trending

CVE-2025-61481: Critical Remote Code Execution Vulnerability in MikroTik RouterOS & SwitchOS

CVE-2025-43939: Dell Unity OS Command Injection (High)

Google disputes false claims of massive Gmail data breach

Microsoft: DNS outage impacts Azure and Microsoft 365 services

3.5B Accounts, 1 Critical Flaw: Meta Closes WhatsApp Data-Harvesting