Tools

Tools: How GitHub Broke Apart Its Massive Database — Without Anyone Noticing

2026-02-07 0 views admin

Tools: How GitHub Broke Apart Its Massive Database — Without Anyone Noticing

Source: Dev.to

1. Context — Why Should You Care? ## What You Will Learn in This Blog ## What Problem This Knowledge Solves ## 2. Jargon & Terminology Breakdown ## Monolithic Database ## Sharding ## Schema ## Schema Domain ## Cluster ## Primary (Server) ## Replica (Server) ## Replication ## Replication Lag ## ProxySQL ## Cut-Over ## Snapshot ## Noisy Neighbor Problem ## Query Linter ## Cross-Domain Join ## Downtime ## 3. The Big Picture — High-Level Mental Model ## The Problem ## The Solution (at 30,000 feet) ## The Analogy ## 4. Concept-by-Concept Deep Dive ## 4.1 — The Starting Point: GitHub's Monolithic MySQL Database ## Simple Definition ## Why This Existed ## The Problems It Caused ## Common Mistakes & Misunderstandings ## 4.2 — Phase 1: Virtual Partitioning (Logical Separation) ## Simple Definition ## Why It Exists ## How It Works — Step by Step ## Real-World Analogy ## Common Mistakes & Misunderstandings ## 4.3 — Phase 2: Physical Partitioning (The Actual Migration) ## Simple Definition ## Why It Exists ## How It Works — Step by Step (The 5-Step Process) ## The Complete Physical Partitioning Flow (Combined Diagram) ## Real-World Analogy (Complete) ## Common Mistakes & Misunderstandings ## 5. Visual Explanation Section — Summary of All Diagrams ## 6. Realistic Use Cases ## Where This Strategy Is Actually Used ## Why Companies Care ## 7. Connecting All Concepts Together ## 8. Final Summary — Professor Style ## 🔑 Key Takeaways ## How This Knowledge Helps You ## 9. When NOT to Use This Approach & Trade-Offs ## When NOT to Shard ## Trade-Offs of Sharding ## Beginner Learning Roadmap ## Wrapping Up Imagine you live in a city with one single hospital. Every person — from a child with a cold, to someone needing heart surgery, to someone getting an eye test — goes to the same building, the same reception desk, the same set of doctors. When the city had 10,000 people, this was fine. But now the city has 10 million people. The waiting room is overflowing. The reception computer is crashing. A patient getting a routine blood test is accidentally blocking the queue for someone who needs emergency surgery. You don't shut down the hospital and build new ones. People are already inside, being treated. You need to split the hospital into specialized clinics — a heart clinic, an eye clinic, a general clinic — while patients are still being treated, without anyone noticing the transition. That is exactly what GitHub did with its database. GitHub — the platform where 100+ million developers store their code — was running on a single, massive MySQL database. As it grew, the database started groaning under the weight. Queries were slow. One team's heavy workload was ruining performance for everyone else. Something had to change. But GitHub is a 24/7 platform. They couldn't just "turn it off for maintenance." They had to split their giant database into smaller, independent databases — while the platform was live and millions of developers were pushing code. If you ever work at a company that is growing fast, you will face database scaling problems. Understanding how GitHub solved this gives you a mental framework for one of the hardest problems in backend engineering — scaling databases without downtime. Before we touch any concept, let's make sure every single term is crystal clear. Read this section like a mini-dictionary. Come back to it anytime you get confused later. Before diving into the details, let's understand the overall story in simple terms. GitHub's entire platform — repositories, pull requests, issues, users, notifications, everything — was stored in one MySQL database cluster. Think of it as one massive warehouse where every department stores their goods. As GitHub grew to serve 100+ million developers, two problems became unbearable: GitHub decided to shard — break the one giant database into many smaller, independent databases, each responsible for one "domain" of data. But here's the brilliance: they didn't do it all at once. They did it in two careful phases: Phase 1 — Virtual Partitioning (Logical separation) "Before we physically move anything, let's first draw boundaries inside the existing database and make sure no one crosses them." Phase 2 — Physical Partitioning (Actual migration) "Now that boundaries are clean, let's physically move each domain to its own database cluster — without downtime." Imagine you have a huge shared apartment with 6 roommates. Everyone's stuff is everywhere — someone's books are on your shelf, your clothes are in their closet. Phase 1 (Virtual Partitioning): Before anyone moves out, you first sort everything. Each person labels their stuff. You make sure no one is using someone else's belongings. You draw invisible boundaries. Phase 2 (Physical Partitioning): Now that everything is sorted, each person moves into their own apartment — taking only their labeled stuff, without any mix-ups. GitHub stored all its data — for repositories, users, issues, pull requests, notifications, actions, and more — in a single MySQL database cluster. Every startup and growing company starts here. A monolithic database is: It's the natural starting point. The problem isn't that GitHub chose this — it's that they outgrew it. Problem 1: Query Volume Millions of developers, millions of repositories, billions of commits. The single database was receiving a staggering volume of queries — reads and writes, all funneled into the same cluster. Problem 2: Noisy Neighbor This is the critical one. Let's say the "Notifications" system runs a heavy batch query to send email digests every morning. While this query runs, it hogs database resources (CPU, memory, I/O). At the same time, a developer is trying to push code to a repository. The push is slow or fails — not because of anything wrong with the "Repositories" tables, but because the "Notifications" tables are hogging resources. ❌ "Monolithic databases are bad." No. They're fine when you're small or medium-sized. The problem is when you scale to GitHub-level traffic. Don't prematurely shard — it adds enormous complexity. ❌ "Just buy a bigger server." This is called "vertical scaling" and it has a ceiling. There's only so much CPU and RAM you can add to a single machine. And it doesn't solve the noisy neighbor problem — all domains still share the same resources. Virtual partitioning means drawing invisible boundaries inside the existing database so that different groups of tables (schema domains) stop interacting with each other — even though they still physically live in the same database. You can't just rip tables out of a database and move them elsewhere if your application code is constantly doing queries that JOIN tables from different domains, or running transactions that span multiple domains. If "Repository" code is doing a JOIN with "Notifications" tables, and you move the Notifications tables to a separate database, that JOIN will break. The application will crash. So before physically moving anything, you need to guarantee that each domain is self-contained — it doesn't depend on tables from other domains. Step 1: Define Schema Domains GitHub categorized their tables into logical groups: Each domain contains only the tables that are tightly related to each other. Step 2: Enforce Boundaries — Eliminate Cross-Domain Queries Once domains were defined, GitHub's goal was: No query or transaction should touch tables from more than one domain. Example of a violation (cross-domain join): This query reaches across two domains. If we later move Notifications to a separate database, this query will break. Example of a clean query (single domain): Step 3: Use Query Linters in Dev/Test Environments GitHub built a tool — a query linter — that automatically scanned every database query in the codebase during development and testing. If a developer wrote a query that crossed domain boundaries, the linter would flag it with an error — just like a spell-checker flags a misspelled word. Step 4: Production Alerts for Cross-Domain Transactions Even after linting, some cross-domain transactions might slip through to production (maybe from older code, or edge cases). GitHub added monitoring in production that raised alerts when a transaction spanned multiple schema domains. This didn't block the transaction (that would cause downtime), but it notified the team so they could fix it. Think of it like a security camera: it doesn't physically stop the thief, but it records and alerts so you can respond. Imagine a company with one giant open-plan office. Marketing, Engineering, Sales, and HR all share the same space, same printer, same coffee machine. Virtual Partitioning is like: Nobody has physically moved yet. Everyone is still in the same building. But the boundaries are now clear and enforced. ❌ "Virtual partitioning alone solves the scaling problem." No. The data is still on the same physical server. You still have shared resources and the noisy neighbor problem. Virtual partitioning is a prerequisite for physical partitioning, not a substitute. ❌ "We can skip virtual partitioning and directly move tables." Dangerous. If your code has cross-domain dependencies, moving tables will break your application. Virtual partitioning ensures you're safe to move. Physical partitioning means actually moving a schema domain's tables from the original database cluster to a brand-new, independent database cluster — while the application is live and serving users. Virtual partitioning drew boundaries, but everything is still on the same server. The noisy neighbor problem persists. Physical partitioning gives each domain its own hardware resources — its own CPU, memory, and disk I/O. After physical partitioning, the "Notifications" domain running a heavy query will only affect its own cluster, not the "Repositories" cluster. This is the core of GitHub's engineering achievement. Let's go step by step. For clarity, let's say we're migrating the "Notifications" domain from the old cluster (Cluster A) to a new cluster (Cluster B). Step 1: Take a Snapshot of the Domain's Tables from Cluster A A snapshot is like taking a photograph of the Notifications tables at a specific moment in time. It captures all the rows, all the data — as it exists right now. Important: While the snapshot is being taken, Cluster A continues serving traffic normally. No downtime. Step 2: Load the Snapshot into Cluster B The snapshot is loaded into the new Cluster B — including its primary and replica servers. At this point, Cluster B has the Notifications data, but it's frozen at the time the snapshot was taken. Any changes that happened on Cluster A after the snapshot are not yet in Cluster B. Step 3: Set Up Replication from Cluster A → Cluster B Now GitHub sets up live replication from Cluster A's primary to Cluster B's primary. This means any new changes to Notifications data on Cluster A are automatically streamed to Cluster B. Cluster B is essentially a replica of Cluster A — but only for the Notifications tables. It's catching up on all the changes that happened since the snapshot. Step 4: Redirect Traffic via ProxySQL (But Still to Cluster A) Here's the clever part. GitHub now updates the application to route all Notifications-related queries through Cluster B's ProxySQL. But — and this is key — that ProxySQL is configured to forward all queries back to Cluster A. Why? Because Cluster B might not be fully caught up yet. You can't serve queries from Cluster B if it's behind Cluster A. This is like changing the receptionist — the new receptionist (Cluster B's ProxySQL) is sitting at the front desk, but for now, she's forwarding all patients to the old hospital. The patients (application) don't notice any difference. Why do this intermediate step? Because when it's time for the final cut-over, the application is already talking to Cluster B's ProxySQL. You only need to change ProxySQL's routing from "forward to A" → "serve from B directly." No application code changes needed. Step 5: The Cut-Over (The Big Moment) This is the most critical step. Let's break it down into micro-steps: Let me walk through each micro-step: ① Check Replication Lag: Before initiating the cut-over, GitHub monitors the replication lag. The cut-over only begins when the lag is less than 1 second — meaning Cluster B is almost perfectly in sync with Cluster A. ② Temporarily Block All Requests: ProxySQL briefly holds all incoming queries. No queries reach either Cluster A or Cluster B. This creates a tiny window where no new writes happen. Think of it like a traffic cop stopping all cars at an intersection for 2 seconds to let an ambulance pass. ③ Wait for Full Synchronization: In this blocked state (which lasts milliseconds), Cluster B processes the final remaining replication data from Cluster A. After this, Cluster B has 100% of the Notifications data — exactly matching Cluster A. ④ Stop Replication: The replication link from A to B is cut. Cluster B is now a standalone, independent cluster. It no longer needs Cluster A. ⑤ Update ProxySQL Routing: ProxySQL's configuration is updated: instead of forwarding queries to Cluster A, it now sends them directly to Cluster B's primary. ⑥ Unblock Requests: All the held queries are released. They flow to Cluster B, which is now the authoritative database for Notifications. The entire process — from blocking requests to unblocking — takes less than 100 milliseconds. That's 0.1 seconds. A human blink takes 300-400 milliseconds. The "downtime" is shorter than a blink. Let's use the hospital analogy one final time. The Hospital Analogy: Snapshot: You photocopy all Cardiology patient records from the main hospital's filing cabinet. Load into new clinic: You bring the photocopies to the new Heart Clinic across the street and set up their filing cabinet. Replication: You set up a live fax machine between the main hospital and the Heart Clinic. Any new Cardiology records added to the main hospital are automatically faxed to the Heart Clinic. Redirect reception: You tell the main hospital's receptionist: "When a Cardiology patient comes in, send them to the Heart Clinic's receptionist." But the Heart Clinic's receptionist, for now, sends them back to the main hospital (because the Heart Clinic isn't fully set up yet). Cut-over: The Heart Clinic is fully caught up. In one swift moment: Nobody in the waiting room even noticed the switch. ❌ "The cut-over requires the app to be shut down." No. ProxySQL acts as a middleman. The app doesn't know or care whether queries go to Cluster A or B. ProxySQL handles the routing transparently. ❌ "100 milliseconds sounds too good to be true." It's achievable because most of the work is done before the cut-over. The snapshot, loading, replication, and catch-up happen over hours or days. The cut-over itself is just: block → sync the last few transactions → switch routing → unblock. ❌ "What about data loss?" The blocking step ensures no new writes happen during the switch. The synchronization step ensures Cluster B has 100% of the data. There is no window where data can be lost. ❌ "Can you do this for all domains at once?" GitHub did it one domain at a time. Migrating everything at once would be too risky. Each domain was virtually partitioned, validated, and then physically migrated independently. Here's a consolidated list of all diagrams you should create/reference for this blog: GitHub's approach isn't unique to GitHub. The pattern of virtual partitioning → physical partitioning is used widely in the industry: Let's zoom out and see how everything connects into one coherent system. GitHub had one giant database that was struggling. They couldn't just split it randomly — their code had queries reaching across different parts of the database. So first, they drew invisible boundaries (virtual partitioning) and made sure nothing crossed them. Once a domain was cleanly isolated, they physically moved it to its own dedicated server using a snapshot-replicate-cutover strategy that caused less than 100ms of disruption. They repeated this process for each domain, one at a time, until the monolith was fully broken up. Let me recap what we've learned today. If you remember nothing else, remember these five things: Monolithic databases are a fine starting point, but they eventually become bottlenecks as you scale — both in performance (query volume) and in isolation (noisy neighbor problem). Sharding is the solution, but you can't just rip tables apart. You need a disciplined approach. Virtual Partitioning comes first. Before you physically move data, you must logically isolate each schema domain. Kill cross-domain queries. Kill cross-domain transactions. Use linters and monitoring to enforce this. Physical Partitioning is the actual migration. It follows a careful 5-step process: Snapshot → Load → Replicate → Redirect → Cut-over. The magic is in ProxySQL (the middleman that makes the switch invisible to the application). The cut-over takes less than 100ms. This is possible because all the heavy lifting (snapshot, replication, catch-up) happens before the cut-over. The cut-over itself is just: block briefly → sync → switch → unblock. If this topic excites you, here's what to learn next: What GitHub achieved is genuinely impressive engineering. They didn't shut down a platform used by 100+ million developers. They didn't lose a single byte of data. They didn't introduce hours of downtime. They drew boundaries. They enforced discipline. They moved carefully, one domain at a time. And the result? A database architecture that can scale with them for the next decade. The next time someone tells you "just shard the database," you'll know it's not that simple — and you'll know exactly what it takes. Thanks for reading. If this blog helped you understand database sharding, consider bookmarking it and sharing it with someone who's preparing for system design interviews. This is the kind of knowledge that separates good engineers from great ones. Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: ┌─────────────────────────────────────────────────┐ │ GITHUB'S SHARDING STRATEGY │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ PHASE 1: VIRTUAL PARTITIONING │ │ │ │ ───────────────────────────────── │ │ │ │ • Define schema domains │ │ │ │ • Enforce boundaries (no cross-domain │ │ │ │ queries or transactions) │ │ │ │ • Use linters + alerts to catch │ │ │ │ violations │ │ │ │ • All still in ONE physical database │ │ │ └──────────────┬───────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────┐ │ │ │ PHASE 2: PHYSICAL PARTITIONING │ │ │ │ ───────────────────────────────── │ │ │ │ • Snapshot domain tables │ │ │ │ • Set up new cluster │ │ │ │ • Replicate data │ │ │ │ • Redirect traffic via ProxySQL │ │ │ │ • Cut-over in < 100ms │ │ │ └──────────────────────────────────────────┘ │ │ │ │ Result: Independent databases per domain, │ │ zero visible downtime │ └─────────────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌─────────────────────────────────────────────────┐ │ GITHUB'S SHARDING STRATEGY │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ PHASE 1: VIRTUAL PARTITIONING │ │ │ │ ───────────────────────────────── │ │ │ │ • Define schema domains │ │ │ │ • Enforce boundaries (no cross-domain │ │ │ │ queries or transactions) │ │ │ │ • Use linters + alerts to catch │ │ │ │ violations │ │ │ │ • All still in ONE physical database │ │ │ └──────────────┬───────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────┐ │ │ │ PHASE 2: PHYSICAL PARTITIONING │ │ │ │ ───────────────────────────────── │ │ │ │ • Snapshot domain tables │ │ │ │ • Set up new cluster │ │ │ │ • Replicate data │ │ │ │ • Redirect traffic via ProxySQL │ │ │ │ • Cut-over in < 100ms │ │ │ └──────────────────────────────────────────┘ │ │ │ │ Result: Independent databases per domain, │ │ zero visible downtime │ └─────────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────────┐ │ GITHUB'S SHARDING STRATEGY │ │ │ │ ┌──────────────────────────────────────────┐ │ │ │ PHASE 1: VIRTUAL PARTITIONING │ │ │ │ ───────────────────────────────── │ │ │ │ • Define schema domains │ │ │ │ • Enforce boundaries (no cross-domain │ │ │ │ queries or transactions) │ │ │ │ • Use linters + alerts to catch │ │ │ │ violations │ │ │ │ • All still in ONE physical database │ │ │ └──────────────┬───────────────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────────────────┐ │ │ │ PHASE 2: PHYSICAL PARTITIONING │ │ │ │ ───────────────────────────────── │ │ │ │ • Snapshot domain tables │ │ │ │ • Set up new cluster │ │ │ │ • Replicate data │ │ │ │ • Redirect traffic via ProxySQL │ │ │ │ • Cut-over in < 100ms │ │ │ └──────────────────────────────────────────┘ │ │ │ │ Result: Independent databases per domain, │ │ zero visible downtime │ └─────────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────────────────┐ │ SINGLE DATABASE CLUSTER │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Repos │ │ Users │ │ Notifs │ │ Issues │ │ │ │ Tables │ │ Tables │ │ Tables │ │ Tables │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ │ │ │ │ │ │ │ │ └──────────────┴──────┬───────┴─────────────┘ │ │ │ │ │ SHARED RESOURCES │ │ (CPU, Memory, Disk I/O) │ │ │ │ │ ┌──────────────┴──────────────┐ │ │ │ 🔴 NOISY NEIGHBOR EFFECT │ │ │ │ Heavy Notifs query slows │ │ │ │ down Repos, Users, Issues │ │ │ └─────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌─────────────────────────────────────────────────────────┐ │ SINGLE DATABASE CLUSTER │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Repos │ │ Users │ │ Notifs │ │ Issues │ │ │ │ Tables │ │ Tables │ │ Tables │ │ Tables │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ │ │ │ │ │ │ │ │ └──────────────┴──────┬───────┴─────────────┘ │ │ │ │ │ SHARED RESOURCES │ │ (CPU, Memory, Disk I/O) │ │ │ │ │ ┌──────────────┴──────────────┐ │ │ │ 🔴 NOISY NEIGHBOR EFFECT │ │ │ │ Heavy Notifs query slows │ │ │ │ down Repos, Users, Issues │ │ │ └─────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────────────────┐ │ SINGLE DATABASE CLUSTER │ │ │ │ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌────────┐ │ │ │ Repos │ │ Users │ │ Notifs │ │ Issues │ │ │ │ Tables │ │ Tables │ │ Tables │ │ Tables │ │ │ └────┬─────┘ └────┬─────┘ └────┬─────┘ └───┬────┘ │ │ │ │ │ │ │ │ └──────────────┴──────┬───────┴─────────────┘ │ │ │ │ │ SHARED RESOURCES │ │ (CPU, Memory, Disk I/O) │ │ │ │ │ ┌──────────────┴──────────────┐ │ │ │ 🔴 NOISY NEIGHBOR EFFECT │ │ │ │ Heavy Notifs query slows │ │ │ │ down Repos, Users, Issues │ │ │ └─────────────────────────────┘ │ └─────────────────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────┐ │ SCHEMA DOMAIN EXAMPLES │ │ │ │ Domain: "Repositories" │ │ ├── repositories table │ │ ├── commits table │ │ ├── branches table │ │ └── pull_requests table │ │ │ │ Domain: "Users" │ │ ├── users table │ │ ├── profiles table │ │ └── sessions table │ │ │ │ Domain: "Notifications" │ │ ├── notifications table │ │ ├── email_preferences table │ │ └── notification_logs table │ │ │ │ Domain: "Issues" │ │ ├── issues table │ │ ├── comments table │ │ └── labels table │ └─────────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌─────────────────────────────────────────────┐ │ SCHEMA DOMAIN EXAMPLES │ │ │ │ Domain: "Repositories" │ │ ├── repositories table │ │ ├── commits table │ │ ├── branches table │ │ └── pull_requests table │ │ │ │ Domain: "Users" │ │ ├── users table │ │ ├── profiles table │ │ └── sessions table │ │ │ │ Domain: "Notifications" │ │ ├── notifications table │ │ ├── email_preferences table │ │ └── notification_logs table │ │ │ │ Domain: "Issues" │ │ ├── issues table │ │ ├── comments table │ │ └── labels table │ └─────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────┐ │ SCHEMA DOMAIN EXAMPLES │ │ │ │ Domain: "Repositories" │ │ ├── repositories table │ │ ├── commits table │ │ ├── branches table │ │ └── pull_requests table │ │ │ │ Domain: "Users" │ │ ├── users table │ │ ├── profiles table │ │ └── sessions table │ │ │ │ Domain: "Notifications" │ │ ├── notifications table │ │ ├── email_preferences table │ │ └── notification_logs table │ │ │ │ Domain: "Issues" │ │ ├── issues table │ │ ├── comments table │ │ └── labels table │ └─────────────────────────────────────────────┘ CODE_BLOCK: -- ❌ BAD: This query joins Repositories tables with Notifications tables SELECT r.name, n.message FROM repositories r JOIN notifications n ON r.user_id = n.user_id WHERE r.id = 12345; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: -- ❌ BAD: This query joins Repositories tables with Notifications tables SELECT r.name, n.message FROM repositories r JOIN notifications n ON r.user_id = n.user_id WHERE r.id = 12345; CODE_BLOCK: -- ❌ BAD: This query joins Repositories tables with Notifications tables SELECT r.name, n.message FROM repositories r JOIN notifications n ON r.user_id = n.user_id WHERE r.id = 12345; CODE_BLOCK: -- ✅ GOOD: This query only touches Repositories tables SELECT r.name, b.branch_name FROM repositories r JOIN branches b ON r.id = b.repository_id WHERE r.id = 12345; Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: -- ✅ GOOD: This query only touches Repositories tables SELECT r.name, b.branch_name FROM repositories r JOIN branches b ON r.id = b.repository_id WHERE r.id = 12345; CODE_BLOCK: -- ✅ GOOD: This query only touches Repositories tables SELECT r.name, b.branch_name FROM repositories r JOIN branches b ON r.id = b.repository_id WHERE r.id = 12345; CODE_BLOCK: ┌─────────────────────────────────────────────────┐ │ QUERY LINTER IN ACTION │ │ │ │ Developer writes query ──► Linter checks it │ │ │ │ ┌──────────────────────────────┐ │ │ │ Query: SELECT * FROM repos │ │ │ │ JOIN notifications ON ... │ │ │ └──────────┬───────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────┐ │ │ │ 🔴 LINTER ERROR: │ │ │ │ "Cross-domain join │ │ │ │ detected! repos domain │ │ │ │ cannot join with │ │ │ │ notifications domain." │ │ │ └──────────────────────────────┘ │ │ │ │ Developer must refactor the query │ │ into two separate domain-specific queries. │ └─────────────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌─────────────────────────────────────────────────┐ │ QUERY LINTER IN ACTION │ │ │ │ Developer writes query ──► Linter checks it │ │ │ │ ┌──────────────────────────────┐ │ │ │ Query: SELECT * FROM repos │ │ │ │ JOIN notifications ON ... │ │ │ └──────────┬───────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────┐ │ │ │ 🔴 LINTER ERROR: │ │ │ │ "Cross-domain join │ │ │ │ detected! repos domain │ │ │ │ cannot join with │ │ │ │ notifications domain." │ │ │ └──────────────────────────────┘ │ │ │ │ Developer must refactor the query │ │ into two separate domain-specific queries. │ └─────────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────────┐ │ QUERY LINTER IN ACTION │ │ │ │ Developer writes query ──► Linter checks it │ │ │ │ ┌──────────────────────────────┐ │ │ │ Query: SELECT * FROM repos │ │ │ │ JOIN notifications ON ... │ │ │ └──────────┬───────────────────┘ │ │ │ │ │ ▼ │ │ ┌──────────────────────────────┐ │ │ │ 🔴 LINTER ERROR: │ │ │ │ "Cross-domain join │ │ │ │ detected! repos domain │ │ │ │ cannot join with │ │ │ │ notifications domain." │ │ │ └──────────────────────────────┘ │ │ │ │ Developer must refactor the query │ │ into two separate domain-specific queries. │ └─────────────────────────────────────────────────┘ CODE_BLOCK: ┌────────────────────────────────┐ │ CLUSTER A │ │ (Original Database) │ │ │ │ ┌──────────┐ ┌──────────────┐ │ │ │ Repos │ │ Notifications│ │ ──── Snapshot taken │ │ Tables │ │ Tables │ │ of Notifications │ └──────────┘ └──────┬───────┘ │ tables only │ │ │ └──────────────────────┼─────────┘ │ ▼ ┌────────────────┐ │ SNAPSHOT │ │ (Point-in-time │ │ copy of │ │ Notifications │ │ data) │ └────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌────────────────────────────────┐ │ CLUSTER A │ │ (Original Database) │ │ │ │ ┌──────────┐ ┌──────────────┐ │ │ │ Repos │ │ Notifications│ │ ──── Snapshot taken │ │ Tables │ │ Tables │ │ of Notifications │ └──────────┘ └──────┬───────┘ │ tables only │ │ │ └──────────────────────┼─────────┘ │ ▼ ┌────────────────┐ │ SNAPSHOT │ │ (Point-in-time │ │ copy of │ │ Notifications │ │ data) │ └────────────────┘ CODE_BLOCK: ┌────────────────────────────────┐ │ CLUSTER A │ │ (Original Database) │ │ │ │ ┌──────────┐ ┌──────────────┐ │ │ │ Repos │ │ Notifications│ │ ──── Snapshot taken │ │ Tables │ │ Tables │ │ of Notifications │ └──────────┘ └──────┬───────┘ │ tables only │ │ │ └──────────────────────┼─────────┘ │ ▼ ┌────────────────┐ │ SNAPSHOT │ │ (Point-in-time │ │ copy of │ │ Notifications │ │ data) │ └────────────────┘ CODE_BLOCK: ┌────────────────┐ │ SNAPSHOT │ └───────┬────────┘ │ ▼ ┌────────────────────────┐ │ CLUSTER B │ │ (New Database) │ │ │ │ ┌──────────────────┐ │ │ │ Notifications │ │ │ │ Tables │ │ │ │ (from snapshot) │ │ │ └──────────────────┘ │ │ │ │ Primary + Replicas │ └────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌────────────────┐ │ SNAPSHOT │ └───────┬────────┘ │ ▼ ┌────────────────────────┐ │ CLUSTER B │ │ (New Database) │ │ │ │ ┌──────────────────┐ │ │ │ Notifications │ │ │ │ Tables │ │ │ │ (from snapshot) │ │ │ └──────────────────┘ │ │ │ │ Primary + Replicas │ └────────────────────────┘ CODE_BLOCK: ┌────────────────┐ │ SNAPSHOT │ └───────┬────────┘ │ ▼ ┌────────────────────────┐ │ CLUSTER B │ │ (New Database) │ │ │ │ ┌──────────────────┐ │ │ │ Notifications │ │ │ │ Tables │ │ │ │ (from snapshot) │ │ │ └──────────────────┘ │ │ │ │ Primary + Replicas │ └────────────────────────┘ CODE_BLOCK: ┌────────────────────┐ ┌────────────────────┐ │ CLUSTER A │ │ CLUSTER B │ │ (Original) │ │ (New) │ │ │ │ │ │ ┌──────────────┐ │ ─────── │ ┌──────────────┐ │ │ │ Primary │──┼─Replication──│ Primary │ │ │ └──────────────┘ │ ─────► │ └──────────────┘ │ │ │ │ │ │ All live changes │ │ Receiving changes │ │ happen here │ │ in real-time │ └────────────────────┘ └────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌────────────────────┐ ┌────────────────────┐ │ CLUSTER A │ │ CLUSTER B │ │ (Original) │ │ (New) │ │ │ │ │ │ ┌──────────────┐ │ ─────── │ ┌──────────────┐ │ │ │ Primary │──┼─Replication──│ Primary │ │ │ └──────────────┘ │ ─────► │ └──────────────┘ │ │ │ │ │ │ All live changes │ │ Receiving changes │ │ happen here │ │ in real-time │ └────────────────────┘ └────────────────────┘ CODE_BLOCK: ┌────────────────────┐ ┌────────────────────┐ │ CLUSTER A │ │ CLUSTER B │ │ (Original) │ │ (New) │ │ │ │ │ │ ┌──────────────┐ │ ─────── │ ┌──────────────┐ │ │ │ Primary │──┼─Replication──│ Primary │ │ │ └──────────────┘ │ ─────► │ └──────────────┘ │ │ │ │ │ │ All live changes │ │ Receiving changes │ │ happen here │ │ in real-time │ └────────────────────┘ └────────────────────┘ CODE_BLOCK: ┌──────────┐ ┌──────────────────┐ ┌────────────┐ │ │ │ CLUSTER B's │ │ │ │ App │────►│ ProxySQL │────►│ CLUSTER A │ │ Server │ │ │ │ (still │ │ │ │ (Middleman - │ │ serving │ │ │ │ routes to A │ │ data) │ │ │ │ for now) │ │ │ └──────────┘ └──────────────────┘ └────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌──────────┐ ┌──────────────────┐ ┌────────────┐ │ │ │ CLUSTER B's │ │ │ │ App │────►│ ProxySQL │────►│ CLUSTER A │ │ Server │ │ │ │ (still │ │ │ │ (Middleman - │ │ serving │ │ │ │ routes to A │ │ data) │ │ │ │ for now) │ │ │ └──────────┘ └──────────────────┘ └────────────┘ CODE_BLOCK: ┌──────────┐ ┌──────────────────┐ ┌────────────┐ │ │ │ CLUSTER B's │ │ │ │ App │────►│ ProxySQL │────►│ CLUSTER A │ │ Server │ │ │ │ (still │ │ │ │ (Middleman - │ │ serving │ │ │ │ routes to A │ │ data) │ │ │ │ for now) │ │ │ └──────────┘ └──────────────────┘ └────────────┘ CODE_BLOCK: ┌──────────────────────────────────────────────────────────┐ │ CUT-OVER PROCESS │ │ │ │ ① Check replication lag between Cluster A & B │ │ └── Must be < 1 second │ │ │ │ ② Temporarily BLOCK all requests │ │ └── ProxySQL holds all incoming queries briefly │ │ └── This ensures no new data is written to A │ │ │ │ ③ Wait for Cluster B to FULLY synchronize │ │ └── Cluster B processes the last remaining │ │ replicated changes from A │ │ │ │ ④ STOP replication from Cluster A │ │ └── Cluster B is now independent │ │ │ │ ⑤ Update ProxySQL routing │ │ └── Route traffic DIRECTLY to Cluster B's primary │ │ │ │ ⑥ UNBLOCK all requests │ │ └── Traffic now flows to Cluster B │ │ │ │ Total time: < 100 milliseconds │ └──────────────────────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌──────────────────────────────────────────────────────────┐ │ CUT-OVER PROCESS │ │ │ │ ① Check replication lag between Cluster A & B │ │ └── Must be < 1 second │ │ │ │ ② Temporarily BLOCK all requests │ │ └── ProxySQL holds all incoming queries briefly │ │ └── This ensures no new data is written to A │ │ │ │ ③ Wait for Cluster B to FULLY synchronize │ │ └── Cluster B processes the last remaining │ │ replicated changes from A │ │ │ │ ④ STOP replication from Cluster A │ │ └── Cluster B is now independent │ │ │ │ ⑤ Update ProxySQL routing │ │ └── Route traffic DIRECTLY to Cluster B's primary │ │ │ │ ⑥ UNBLOCK all requests │ │ └── Traffic now flows to Cluster B │ │ │ │ Total time: < 100 milliseconds │ └──────────────────────────────────────────────────────────┘ CODE_BLOCK: ┌──────────────────────────────────────────────────────────┐ │ CUT-OVER PROCESS │ │ │ │ ① Check replication lag between Cluster A & B │ │ └── Must be < 1 second │ │ │ │ ② Temporarily BLOCK all requests │ │ └── ProxySQL holds all incoming queries briefly │ │ └── This ensures no new data is written to A │ │ │ │ ③ Wait for Cluster B to FULLY synchronize │ │ └── Cluster B processes the last remaining │ │ replicated changes from A │ │ │ │ ④ STOP replication from Cluster A │ │ └── Cluster B is now independent │ │ │ │ ⑤ Update ProxySQL routing │ │ └── Route traffic DIRECTLY to Cluster B's primary │ │ │ │ ⑥ UNBLOCK all requests │ │ └── Traffic now flows to Cluster B │ │ │ │ Total time: < 100 milliseconds │ └──────────────────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────────────────────────┐ │ PHYSICAL PARTITIONING - COMPLETE FLOW │ │ │ │ Step 1: Snapshot │ │ ┌──────────┐ │ │ │Cluster A │──── Take snapshot of ────► ┌──────────┐ │ │ │(Original)│ Notifications tables │ Snapshot │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ │ Step 2: Load snapshot into Cluster B │ │ │ ▼ │ │ ┌──────────┐ │ │ │Cluster B │ │ │ │ (New) │ │ │ └────┬─────┘ │ │ │ │ │ Step 3: Set up replication A → B │ │ │ ┌──────────┐ ┌────┴─────┐ │ │ │Cluster A │════ Replication Stream ═══►│Cluster B │ │ │ │ Primary │ │ Primary │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ │ Step 4: Redirect app → Cluster B's ProxySQL │ │ │ ┌──────┐ ┌──────────────┐ ┌────┴─────┐ │ │ │ App │───►│ B's ProxySQL │───────────►│Cluster A │ │ │ └──────┘ └──────────────┘ (still └──────────┘ │ │ routes │ │ to A) │ │ │ │ Step 5: Cut-over (< 100ms) │ │ ┌──────┐ ┌──────────────┐ ┌──────────┐ │ │ │ App │───►│ B's ProxySQL │───────────►│Cluster B │ ✅ │ │ └──────┘ └──────────────┘ (now └──────────┘ │ │ routes │ │ to B!) │ │ │ │ Cluster A no longer handles Notifications. │ │ Cluster B is fully independent. │ └─────────────────────────────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌─────────────────────────────────────────────────────────────────┐ │ PHYSICAL PARTITIONING - COMPLETE FLOW │ │ │ │ Step 1: Snapshot │ │ ┌──────────┐ │ │ │Cluster A │──── Take snapshot of ────► ┌──────────┐ │ │ │(Original)│ Notifications tables │ Snapshot │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ │ Step 2: Load snapshot into Cluster B │ │ │ ▼ │ │ ┌──────────┐ │ │ │Cluster B │ │ │ │ (New) │ │ │ └────┬─────┘ │ │ │ │ │ Step 3: Set up replication A → B │ │ │ ┌──────────┐ ┌────┴─────┐ │ │ │Cluster A │════ Replication Stream ═══►│Cluster B │ │ │ │ Primary │ │ Primary │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ │ Step 4: Redirect app → Cluster B's ProxySQL │ │ │ ┌──────┐ ┌──────────────┐ ┌────┴─────┐ │ │ │ App │───►│ B's ProxySQL │───────────►│Cluster A │ │ │ └──────┘ └──────────────┘ (still └──────────┘ │ │ routes │ │ to A) │ │ │ │ Step 5: Cut-over (< 100ms) │ │ ┌──────┐ ┌──────────────┐ ┌──────────┐ │ │ │ App │───►│ B's ProxySQL │───────────►│Cluster B │ ✅ │ │ └──────┘ └──────────────┘ (now └──────────┘ │ │ routes │ │ to B!) │ │ │ │ Cluster A no longer handles Notifications. │ │ Cluster B is fully independent. │ └─────────────────────────────────────────────────────────────────┘ CODE_BLOCK: ┌─────────────────────────────────────────────────────────────────┐ │ PHYSICAL PARTITIONING - COMPLETE FLOW │ │ │ │ Step 1: Snapshot │ │ ┌──────────┐ │ │ │Cluster A │──── Take snapshot of ────► ┌──────────┐ │ │ │(Original)│ Notifications tables │ Snapshot │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ │ Step 2: Load snapshot into Cluster B │ │ │ ▼ │ │ ┌──────────┐ │ │ │Cluster B │ │ │ │ (New) │ │ │ └────┬─────┘ │ │ │ │ │ Step 3: Set up replication A → B │ │ │ ┌──────────┐ ┌────┴─────┐ │ │ │Cluster A │════ Replication Stream ═══►│Cluster B │ │ │ │ Primary │ │ Primary │ │ │ └──────────┘ └────┬─────┘ │ │ │ │ │ Step 4: Redirect app → Cluster B's ProxySQL │ │ │ ┌──────┐ ┌──────────────┐ ┌────┴─────┐ │ │ │ App │───►│ B's ProxySQL │───────────►│Cluster A │ │ │ └──────┘ └──────────────┘ (still └──────────┘ │ │ routes │ │ to A) │ │ │ │ Step 5: Cut-over (< 100ms) │ │ ┌──────┐ ┌──────────────┐ ┌──────────┐ │ │ │ App │───►│ B's ProxySQL │───────────►│Cluster B │ ✅ │ │ └──────┘ └──────────────┘ (now └──────────┘ │ │ routes │ │ to B!) │ │ │ │ Cluster A no longer handles Notifications. │ │ Cluster B is fully independent. │ └─────────────────────────────────────────────────────────────────┘ CODE_BLOCK: BEFORE AFTER ────── ───── ┌──────────┐ ┌──────────┐ ┌──────────┐ │ App │ │ App │ │ App │ └────┬─────┘ └──┬───┬───┘ └──┬───┬───┘ │ │ │ │ │ ▼ ▼ │ │ ▼ ┌──────────────┐ ┌────────┐ │ ┌──────┴──┐ │ ONE GIANT │ │ProxySQL│ │ │ProxySQL │ │ DATABASE │ │ (A) │ │ │ (B) │ │ │ └───┬────┘ │ └───┬─────┘ │ ┌────┐┌────┐│ ▼ │ ▼ │ │Repo││User││ ┌────────┐ │ ┌────────┐ │ └────┘└────┘│ │Cluster │ │ │Cluster │ │ ┌────┐┌────┐│ │ for │ │ │ for │ │ │Noti││Issu││ │ Repos │ │ │ Notifs │ │ └────┘└────┘│ └────────┘ │ └────────┘ └──────────────┘ │ ┌──────┴──┐ │ProxySQL │ │ (C) │ └───┬─────┘ ▼ ┌────────┐ │Cluster │ │ for │ │ Users │ └────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: BEFORE AFTER ────── ───── ┌──────────┐ ┌──────────┐ ┌──────────┐ │ App │ │ App │ │ App │ └────┬─────┘ └──┬───┬───┘ └──┬───┬───┘ │ │ │ │ │ ▼ ▼ │ │ ▼ ┌──────────────┐ ┌────────┐ │ ┌──────┴──┐ │ ONE GIANT │ │ProxySQL│ │ │ProxySQL │ │ DATABASE │ │ (A) │ │ │ (B) │ │ │ └───┬────┘ │ └───┬─────┘ │ ┌────┐┌────┐│ ▼ │ ▼ │ │Repo││User││ ┌────────┐ │ ┌────────┐ │ └────┘└────┘│ │Cluster │ │ │Cluster │ │ ┌────┐┌────┐│ │ for │ │ │ for │ │ │Noti││Issu││ │ Repos │ │ │ Notifs │ │ └────┘└────┘│ └────────┘ │ └────────┘ └──────────────┘ │ ┌──────┴──┐ │ProxySQL │ │ (C) │ └───┬─────┘ ▼ ┌────────┐ │Cluster │ │ for │ │ Users │ └────────┘ CODE_BLOCK: BEFORE AFTER ────── ───── ┌──────────┐ ┌──────────┐ ┌──────────┐ │ App │ │ App │ │ App │ └────┬─────┘ └──┬───┬───┘ └──┬───┬───┘ │ │ │ │ │ ▼ ▼ │ │ ▼ ┌──────────────┐ ┌────────┐ │ ┌──────┴──┐ │ ONE GIANT │ │ProxySQL│ │ │ProxySQL │ │ DATABASE │ │ (A) │ │ │ (B) │ │ │ └───┬────┘ │ └───┬─────┘ │ ┌────┐┌────┐│ ▼ │ ▼ │ │Repo││User││ ┌────────┐ │ ┌────────┐ │ └────┘└────┘│ │Cluster │ │ │Cluster │ │ ┌────┐┌────┐│ │ for │ │ │ for │ │ │Noti││Issu││ │ Repos │ │ │ Notifs │ │ └────┘└────┘│ └────────┘ │ └────────┘ └──────────────┘ │ ┌──────┴──┐ │ProxySQL │ │ (C) │ └───┬─────┘ ▼ ┌────────┐ │Cluster │ │ for │ │ Users │ └────────┘ CODE_BLOCK: ┌──────────────────────────────────────────────────────────────┐ │ THE COMPLETE PICTURE │ │ │ │ 1. GitHub starts with ONE big MySQL database (Monolith) │ │ │ │ │ ▼ │ │ 2. Problems emerge: Query overload + Noisy neighbors │ │ │ │ │ ▼ │ │ 3. Decision: SHARD the database │ │ │ │ │ ┌─────────┴──────────┐ │ │ ▼ ▼ │ │ 4. PHASE 1: Virtual 5. PHASE 2: Physical │ │ Partitioning Partitioning │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │• Define domains │ │• Snapshot tables │ │ │ │• Kill cross- │ │• Load into new │ │ │ │ domain queries │ │ cluster │ │ │ │• Add linters │ │• Set up replication │ │ │ │• Add production │ │• Redirect via │ │ │ │ alerts │ │ ProxySQL │ │ │ │ │ │• Cut-over (<100ms) │ │ │ │ (Prerequisite │ │ │ │ │ │ for Phase 2) │ │ (Repeat per domain) │ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ │ │ ▼ │ │ 6. Result: Multiple independent database clusters, │ │ each serving one domain, zero visible downtime, │ │ no noisy neighbors, independently scalable. │ └──────────────────────────────────────────────────────────────┘ Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: ┌──────────────────────────────────────────────────────────────┐ │ THE COMPLETE PICTURE │ │ │ │ 1. GitHub starts with ONE big MySQL database (Monolith) │ │ │ │ │ ▼ │ │ 2. Problems emerge: Query overload + Noisy neighbors │ │ │ │ │ ▼ │ │ 3. Decision: SHARD the database │ │ │ │ │ ┌─────────┴──────────┐ │ │ ▼ ▼ │ │ 4. PHASE 1: Virtual 5. PHASE 2: Physical │ │ Partitioning Partitioning │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │• Define domains │ │• Snapshot tables │ │ │ │• Kill cross- │ │• Load into new │ │ │ │ domain queries │ │ cluster │ │ │ │• Add linters │ │• Set up replication │ │ │ │• Add production │ │• Redirect via │ │ │ │ alerts │ │ ProxySQL │ │ │ │ │ │• Cut-over (<100ms) │ │ │ │ (Prerequisite │ │ │ │ │ │ for Phase 2) │ │ (Repeat per domain) │ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ │ │ ▼ │ │ 6. Result: Multiple independent database clusters, │ │ each serving one domain, zero visible downtime, │ │ no noisy neighbors, independently scalable. │ └──────────────────────────────────────────────────────────────┘ CODE_BLOCK: ┌──────────────────────────────────────────────────────────────┐ │ THE COMPLETE PICTURE │ │ │ │ 1. GitHub starts with ONE big MySQL database (Monolith) │ │ │ │ │ ▼ │ │ 2. Problems emerge: Query overload + Noisy neighbors │ │ │ │ │ ▼ │ │ 3. Decision: SHARD the database │ │ │ │ │ ┌─────────┴──────────┐ │ │ ▼ ▼ │ │ 4. PHASE 1: Virtual 5. PHASE 2: Physical │ │ Partitioning Partitioning │ │ ┌─────────────────┐ ┌──────────────────────┐ │ │ │• Define domains │ │• Snapshot tables │ │ │ │• Kill cross- │ │• Load into new │ │ │ │ domain queries │ │ cluster │ │ │ │• Add linters │ │• Set up replication │ │ │ │• Add production │ │• Redirect via │ │ │ │ alerts │ │ ProxySQL │ │ │ │ │ │• Cut-over (<100ms) │ │ │ │ (Prerequisite │ │ │ │ │ │ for Phase 2) │ │ (Repeat per domain) │ │ │ └─────────────────┘ └──────────────────────┘ │ │ │ │ │ ▼ │ │ 6. Result: Multiple independent database clusters, │ │ each serving one domain, zero visible downtime, │ │ no noisy neighbors, independently scalable. │ └──────────────────────────────────────────────────────────────┘ CODE_BLOCK: 1. SQL fundamentals (JOINs, transactions, indexes) │ ▼ 2. Database replication (primary-replica architecture) │ ▼ 3. Read replicas and load balancing │ ▼ 4. Database sharding strategies (horizontal vs. vertical) │ ▼ 5. Proxy layers (ProxySQL, PgBouncer, Vitess) │ ▼ 6. Zero-downtime migration patterns │ ▼ 7. Distributed systems fundamentals (CAP theorem, consistency) Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: 1. SQL fundamentals (JOINs, transactions, indexes) │ ▼ 2. Database replication (primary-replica architecture) │ ▼ 3. Read replicas and load balancing │ ▼ 4. Database sharding strategies (horizontal vs. vertical) │ ▼ 5. Proxy layers (ProxySQL, PgBouncer, Vitess) │ ▼ 6. Zero-downtime migration patterns │ ▼ 7. Distributed systems fundamentals (CAP theorem, consistency) CODE_BLOCK: 1. SQL fundamentals (JOINs, transactions, indexes) │ ▼ 2. Database replication (primary-replica architecture) │ ▼ 3. Read replicas and load balancing │ ▼ 4. Database sharding strategies (horizontal vs. vertical) │ ▼ 5. Proxy layers (ProxySQL, PgBouncer, Vitess) │ ▼ 6. Zero-downtime migration patterns │ ▼ 7. Distributed systems fundamentals (CAP theorem, consistency) - What a monolithic database is and why it becomes a problem - What "sharding" means and why companies do it - GitHub's exact two-phase strategy: Virtual Partitioning → Physical Partitioning - How they achieved a cut-over in under 100 milliseconds - Every technical term explained from scratch with analogies - Step-by-step walkthrough with diagrams - High query volume: Too many people asking for too many things from the same warehouse at the same time. - Noisy neighbor problem: The "Repositories" team's heavy operations slowed down the "Notifications" team, even though their data had nothing to do with each other. - Simple to set up - Simple to query (you can JOIN any table with any other table) - Simple to maintain (one place, one backup) - Defining zones: "Marketing sits in the east wing, Engineering in the west wing." - Making sure Marketing documents don't reference Engineering's internal files. - Installing a system that alerts if someone from Marketing accidentally uses Engineering's private printer. - Snapshot: You photocopy all Cardiology patient records from the main hospital's filing cabinet. - Load into new clinic: You bring the photocopies to the new Heart Clinic across the street and set up their filing cabinet. - Replication: You set up a live fax machine between the main hospital and the Heart Clinic. Any new Cardiology records added to the main hospital are automatically faxed to the Heart Clinic. - Redirect reception: You tell the main hospital's receptionist: "When a Cardiology patient comes in, send them to the Heart Clinic's receptionist." But the Heart Clinic's receptionist, for now, sends them back to the main hospital (because the Heart Clinic isn't fully set up yet). - Cut-over: The Heart Clinic is fully caught up. In one swift moment: The receptionist stops accepting patients for 0.1 seconds The fax machine confirms all records are synced The fax line is disconnected The Heart Clinic's receptionist starts directing patients to the Heart Clinic's own doctors Patients resume flowing — now to the Heart Clinic directly - The receptionist stops accepting patients for 0.1 seconds - The fax machine confirms all records are synced - The fax line is disconnected - The Heart Clinic's receptionist starts directing patients to the Heart Clinic's own doctors - Patients resume flowing — now to the Heart Clinic directly - The receptionist stops accepting patients for 0.1 seconds - The fax machine confirms all records are synced - The fax line is disconnected - The Heart Clinic's receptionist starts directing patients to the Heart Clinic's own doctors - Patients resume flowing — now to the Heart Clinic directly - Reliability: One domain's failure doesn't bring down the entire platform. - Performance: Each domain gets dedicated resources — no more noisy neighbors. - Independent scaling: The "Repositories" domain needs more powerful hardware? Scale just that cluster, not the entire database. - Team independence: The Notifications team can deploy changes to their database without coordinating with the Repositories team. - Monolithic databases are a fine starting point, but they eventually become bottlenecks as you scale — both in performance (query volume) and in isolation (noisy neighbor problem). - Sharding is the solution, but you can't just rip tables apart. You need a disciplined approach. - Virtual Partitioning comes first. Before you physically move data, you must logically isolate each schema domain. Kill cross-domain queries. Kill cross-domain transactions. Use linters and monitoring to enforce this. - Physical Partitioning is the actual migration. It follows a careful 5-step process: Snapshot → Load → Replicate → Redirect → Cut-over. The magic is in ProxySQL (the middleman that makes the switch invisible to the application). - The cut-over takes less than 100ms. This is possible because all the heavy lifting (snapshot, replication, catch-up) happens before the cut-over. The cut-over itself is just: block briefly → sync → switch → unblock. - If you're a backend developer, you now understand one of the most complex database operations companies face. This is senior-level knowledge. - If you're preparing for system design interviews, database sharding and zero-downtime migrations are frequently asked topics. You now have a concrete, real-world example to reference. - If you're a startup developer, you know the roadmap: start with a monolith, and when you outgrow it, follow this phased approach.

🏷️ Tags

how-totutorialguidedev.toaimlserverroutingswitchmysqldatabasegitgithub