Tools: Scaling PostgreSQL without Microservices: Lessons from Notion’s 480 Shards

Tools: Scaling PostgreSQL without Microservices: Lessons from Notion’s 480 Shards

Source: Dev.to

I’ve been using Notion to manage my projects for a long time—it’s a faithful friend in my workflow. Recently, while studying database scaling, a thought hit me: How does the "manager" manage itself? With millions of users reading and writing data every second, the infrastructure behind the scenes must be immense. I decided to dive deep into their architecture, and here is what I learned about the scaling strategy that keeps Notion running. 📝 TL;DR: Scaling Notion’s Monolith I spent the last few days deconstructing how Notion scaled their PostgreSQL database to handle billions of blocks while keeping their Node.js monolith. Here is the blueprint of what I learned: Application-Level Sharding: Instead of one massive DB, they use 480 logical shards mapped to a smaller set of physical nodes. The Shard Router: The logic lives in the TypeScript code, using a simple space_id % 480 math to route requests instantly. PgBouncer: They use this as a "traffic controller" to pool connections and prevent the database from choking under high load. Zero-Downtime Migrations: I broke down how they moved billions of rows using a "Shadow Write" strategy to keep the app live during the transition. The Architecture at a Glance Chapter 1: The Problem with Monolith In its early days, Notion followed a simple architecture: a Node.js backend paired with a single PostgreSQL instance. But eventually, they hit the ceiling: CPU Saturation: Daily spikes were hitting 90%+. The Vacuum Problem: Autovacuum couldn't keep up, risking a Transaction ID Wraparound—a state where the database stops accepting writes to prevent corruption. Chapter 2: Why Not Microservices? The common logic is: Split the code, split the load.
Notion took the opposite approach. They kept the Monolithic backend to maintain operational velocity and data locality (essential for their complex "block" graph) and focused entirely on sharding the Persistence Layer. Chapter 3: The 480-Shard Blueprint The "pro move" here was decoupling data from hardware using Logical Shards: The Key: They used space_id as the Partition Key so all data for one workspace stays together for fast joins. The Setup: They created 480 independent schemas (Logical Shards) and distributed them across 32 physical AWS RDS instances. The Result: When a server gets overwhelmed, they just "pick up" a logical schema and move it to a new server. Linear Scalability. Chapter 4: The Great "Shadow" Migration How do you move 480 terabytes of data while the plane is mid-flight? The Backfill: Historical data moved to shards in the background. Double Writing: Code wrote new changes to both the old DB and new shards simultaneously. The Cutover: Once a comparison engine verified the data was identical, they flipped the switch. SOME FUTURE IMPROVEMENTS - Scaling to 96 Nodes
By 2023, the original 32 servers hit their limit. Because we had 480 Logical Shards, scaling was simple: we tripled capacity to 96 nodes. We didn't change any code; we just redistributed the shards. This is the beauty of Linear Scalability. It’s Blocks All the Way Down
Why the rapid growth? In Notion, Everything is a Block. A single page is actually a tree of dozens of individual units (text, toggles, images). We aren't just scaling pages; we are managing billions of atomic blocks. Data Lakes & Connection Hubs
Analytics: To run reports across 480 separate databases, we piped everything into a central Data Lake using tools like Fivetran and Snowflake.
Networking: We used PgBouncer for connection pooling, preventing our backend from choking while trying to talk to hundreds of shards at once. Wrapping Up
Notion’s journey proves that you don't always need to chase the latest architectural trends. By focusing on the actual bottleneck—the Persistence Layer—they scaled to billions of blocks while keeping their team lean and their code manageable. I am in the early stages of my engineering journey and would be happy to learn and contribute to more conversations like this. Do share your thoughts in the comments! Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse - Scaling to 96 Nodes
By 2023, the original 32 servers hit their limit. Because we had 480 Logical Shards, scaling was simple: we tripled capacity to 96 nodes. We didn't change any code; we just redistributed the shards. This is the beauty of Linear Scalability.
- It’s Blocks All the Way Down
Why the rapid growth? In Notion, Everything is a Block. A single page is actually a tree of dozens of individual units (text, toggles, images). We aren't just scaling pages; we are managing billions of atomic blocks.
- Data Lakes & Connection Hubs
Analytics: To run reports across 480 separate databases, we piped everything into a central Data Lake using tools like Fivetran and Snowflake.
Networking: We used PgBouncer for connection pooling, preventing our backend from choking while trying to talk to hundreds of shards at once.