Multi-region active-passive / active-active
When to reach for this
Reach for this when…
- Global user base with regional latency SLOs (<100 ms)
- Data-residency compliance (GDPR, LGPD, China PIPL)
- DR requirements measured in minutes not hours (RTO <15 min)
- Business cannot tolerate a full regional outage
- Traffic volume exceeds single-region capacity
Not really this pattern when…
- A cold backup in another region satisfies the DR requirement
- Team does not have 24/7 SRE to operate multi-region
- Cross-region latency would break the user experience regardless
- Single-region with AZ redundancy already meets the SLA
Good vs bad answer
Interviewer probe
“Take this design multi-region for a global user base with GDPR compliance.”
Weak answer
"Deploy to 3 regions with a multi-region database. DynamoDB Global Tables handles replication. Route 53 for DNS."
Strong answer
"Partition users by home region — EU users write to EU, US users write to US. No cross-region PII replication, which satisfies GDPR by construction. Product catalogue replicated async to both regions for local reads. Inventory centralised in one region (checkout pays ~100 ms cross-region RTT). GeoDNS routes users to their nearest region; travelling users are proxied to their home region for writes. For DR, if one region fails, the other absorbs its browsing traffic — user-specific data is unavailable until the region recovers, which we show as a degraded-mode banner. Active-passive is half the complexity if the business only needs DR without latency improvement."
Why it wins: Names the partition strategy, explicitly addresses GDPR, handles the hard problem (shared inventory), describes degraded mode on failure, and offers the simpler alternative.
Cheat sheet
- •Default: single region. Multi-region only when DR, latency, or compliance demands it.
- •Active-passive for DR. Active-active only when AP is insufficient.
- •Active-active partition-by-user is the "no-conflict" pattern. Use it by default.
- •LWW silently loses data. Fine for caches; unacceptable for business data.
- •Cross-region RTT: 80–100 ms transatlantic, 150–200 ms US-Asia. Keep writes in-region.
- •Async replication RPO: seconds. Sync RPO: zero but writes pay full RTT.
- •Failover must be drilled quarterly in production. Untested failover = no failover.
- •DNS TTL of 60 s for fast failover. Anycast for instant (<5 s) failover.
- •GDPR: region-pin EU PII. No cross-border replication of personal data without safeguards.
- •Operational cost > infrastructure cost. Multi-region needs a funded SRE team.
- •CRDTs: G-Counter, PN-Counter, OR-Set. Conflict-free by math for specific data types.
- •Schema migrations must be backward-compatible in active-active. Expand-contract pattern.
Core concept
Why multi-region?
Three forces push a system beyond a single region: latency, disaster recovery, and compliance. A user in Frankfurt hitting an origin in us-east-1 pays 80–100 ms of network round-trip before the server even begins processing. Multiply by the number of serial requests in a page load and the experience degrades measurably. DR demands that a regional outage — cloud provider failure, natural disaster, misconfigured firewall — does not take the product offline. And regulations like GDPR, LGPD, and China's PIPL may require that user data physically resides in a specific geography.
GeoDNS routes users to the nearest region. Each region has its own app tier and DB. Async bidirectional replication keeps data eventually consistent across regions.
The three postures
Single region. One region serves everything. Other regions exist only as cold backups (if at all). Cheapest to build and operate; a regional outage equals a total site outage. This is where every system starts, and where many should stay until a concrete force pushes them further.
Active-passive. A primary region handles all writes and reads. A secondary region maintains a replica database updated via asynchronous WAL streaming. On failure, DNS is re-pointed to the standby and the replica is promoted to primary. RTO is minutes to tens of minutes (DNS TTL + promotion time). RPO is seconds — the amount of WAL data in-flight at the moment of failure. Active-passive is the workhorse DR pattern: it doubles infrastructure cost but keeps operational complexity manageable because only one region accepts writes at any time.
Active-active. Both (or all) regions accept reads and writes simultaneously. Cross-region replication is bidirectional. This unlocks the lowest latency (every user writes to their nearest region) and the highest availability (no failover — both regions are already live). The cost is conflict resolution: when two regions write the same row concurrently, the system must decide which write wins. This is not a database problem — it is a product-level design decision.
The replication spectrum
| Mode | RPO | Write latency | Conflict risk |
|---|---|---|---|
| Synchronous | 0 (zero data loss) | +80–200 ms (cross-region RTT) | None — one region waits |
| Semi-synchronous | Near-zero | +40–100 ms (one ack) | Minimal |
| Asynchronous | Seconds | No penalty | Concurrent writes possible |
Synchronous replication guarantees zero data loss but makes every write as slow as the cross-region round-trip — 80–200 ms depending on geography. Semi-synchronous requires at least one remote ack, splitting the difference. Asynchronous replication imposes no write-path latency penalty but opens a window where both regions can write the same key before either sees the other's change.
RTO and RPO — the numbers that drive the decision
RTO (Recovery Time Objective): how long the system can be down after a regional failure. Active-passive with automated failover achieves 1–5 minutes. Active-active has effectively zero RTO because the other region is already serving traffic.
RPO (Recovery Point Objective): how much data you can afford to lose. Async replication has seconds of RPO (the in-flight WAL at failure). Sync replication has zero RPO. The business must name these numbers — they drive the entire architecture.
Cross-region latency budget
A cross-region round-trip is 80–100 ms for transatlantic, 150–200 ms for US-to-Asia, and 250–300 ms for a full globe hop. Any synchronous cross-region call on the hot path adds this latency to every request. The design principle is simple: writes stay in-region; reads can be local replicas; cross-region calls are async. If a write must be confirmed in a remote region (sync replication, distributed transactions), that latency is the price of consistency. Name the budget explicitly in your design.
Conflict resolution — the hard part
Active-active conflict resolution is the single hardest design decision in multi-region. Four strategies, in order of preference:
- 1Partition by user / tenant. Each row has a home region determined by user ID or tenant. Only that region writes the row. No conflicts by construction. This is what WhatsApp, Slack, and most large-scale systems use. It is active-active for the fleet but single-writer for each row.
- 1Last-write-wins (LWW). Timestamps decide conflicts. The write with the higher timestamp survives; the other is silently dropped. Simple to implement but silently loses data. Acceptable for ephemeral caches, session state, analytics counters. Unacceptable for payments, messages, or any auditable business data.
- 1CRDTs (Conflict-free Replicated Data Types). Data structures that merge without conflicts by mathematical construction — counters, sets, registers, collaborative text (Yjs, Automerge). Elegant when the data model fits; limited applicability. CRDTs cannot express arbitrary business logic.
- 1Application merge. Show users a conflict UI (Google Docs-style "someone else edited this"). Heavy to build, confusing for users. Last resort.
Data residency and compliance
GDPR requires that EU citizens' personal data be processable under EU law. The simplest compliance architecture is region-pinning: EU users' data lives in an EU region and never leaves. Cross-region replication of EU PII to a US region may violate GDPR unless the US region has adequate safeguards (SCCs, adequacy decisions). Name this constraint early in the design — it often forces partition-by-user with geographic affinity.
Canonical examples
- →Global SaaS (Slack, Notion) with regional primaries
- →Messaging apps (WhatsApp) partitioned per user home region
- →Streaming services for regional content licensing + latency
- →Financial systems with strict RTO/RPO (<5 min / <1 s)
- →E-commerce platforms with US + EU presence and GDPR obligations
Variants
Single region — no DR
One region serves everything. Simplest topology; a regional outage is a total outage.
All users hit one region. A regional outage means total site outage. Simplest but riskiest.
The single-region topology is where every system begins and where many should stay. All users — regardless of geography — hit one region's load balancer, application tier, and database. There is no replica, no standby, no failover path. If the region goes down, the product is offline.
All users hit one region. A regional outage means total site outage. Simplest but riskiest.
Why it works at small-to-medium scale. Operational simplicity is dramatically undervalued. One region means one database, one deploy pipeline, one set of monitoring dashboards, one on-call rotation. There are no replication lag alerts, no conflict-resolution bugs, no split-brain incidents at 3 AM. For a team of 5–20 engineers serving users primarily in one geography, single-region is not a compromise — it is the right choice.
Cost model. Multi-region roughly triples infrastructure cost (two full stacks + replication bandwidth + cross-region data transfer). It also triples operational complexity. If the business SLA allows 4 hours of downtime per year (99.95%), a well-architected single-region setup with multi-AZ redundancy can meet it.
When to leave single-region. Three triggers: (1) regulatory requirement to store data in a specific geography (GDPR, PIPL), (2) latency SLO that a single region cannot meet for a global user base, (3) business-critical RTO that requires automated regional failover. If none of these apply, stay single-region and invest the engineering effort elsewhere.
AZ redundancy is not multi-region. Running across 3 availability zones within one region provides hardware-level fault tolerance (rack, power, network diversity) without the complexity of cross-region replication. Multi-AZ is the standard baseline; multi-region is the next step only when AZ redundancy is insufficient.
Pros
- +Simplest to build, deploy, and operate
- +No replication lag or conflict-resolution bugs
- +Lowest infrastructure cost
Cons
- −Regional outage = total outage
- −High latency for geographically distant users
- −Cannot meet data-residency requirements for multiple jurisdictions
Choose this variant when
- Users primarily in one geography
- SLA achievable with multi-AZ within one region
- Team size does not support multi-region operations
Active-passive — primary + warm standby
Primary region handles all traffic. Standby replays WAL and is promoted on failure.
Primary region serves all traffic. Standby region has a read replica updated via async replication. Failover requires DNS cut-over.
Active-passive is the workhorse DR topology. The primary region handles 100% of reads and writes. The standby region maintains a continuously-updated replica via asynchronous WAL streaming (PostgreSQL) or binlog replication (MySQL). The standby application tier is deployed but idle — it receives no user traffic until failover.
Primary region serves all traffic. Standby region has a read replica updated via async replication. Failover requires DNS cut-over.
Failover mechanics. When the primary region fails: (1) health checks detect the failure (30–60 seconds), (2) the standby database is promoted to primary (10–30 seconds), (3) DNS is updated to point at the standby region (propagation 60–300 seconds depending on TTL). Total RTO: 2–5 minutes with automation, 15–60 minutes with manual intervention. RPO depends on replication lag — typically 0.1–5 seconds of async lag.
The untested-failover trap. The single most common failure mode of active-passive is an untested failover. Configs drift: the standby's environment variables point to a stale secret, the connection string references the wrong DB, the SSL certificate expired. Quarterly failover drills in production are not optional — they are the only way to know your failover actually works.
Cost model. You pay for a full second region (compute + storage + networking) that handles zero traffic in steady state. This feels wasteful, but the cost of a 4-hour outage at a company doing $10M ARR is roughly $4,500 in direct revenue loss, plus customer trust and SLA credits. The standby region costs far less.
Semi-sync for tighter RPO. If the business requires RPO < 1 second, enable semi-synchronous replication: the primary waits for at least one standby ack before confirming the write. This adds 40–100 ms to write latency (half the cross-region RTT for one ack) but guarantees that committed data exists in both regions. Full synchronous replication (wait for all replicas) doubles the latency penalty and is rarely worth it.
Pros
- +Simple mental model — one writer, one reader
- +No conflict resolution needed
- +Proven, well-understood pattern
Cons
- −Standby wastes resources in steady state
- −Failover takes minutes (RTO)
- −Distant users still hit the primary (no latency benefit)
Choose this variant when
- DR is the primary driver, not latency
- RTO of 2–5 minutes is acceptable
- Team prefers operational simplicity over active-active complexity
Active-active — eventual consistency
Both regions serve reads and writes with async replication. Conflict resolution required.
Both regions accept reads and writes. Async bidirectional replication with eventual consistency. Requires conflict resolution strategy.
Active-active with eventual consistency is the target state for global-scale systems. Both regions accept reads and writes. Async bidirectional replication propagates changes between regions with a lag window of 0.1–5 seconds. During that window, the two databases may have different values for the same key — this is the consistency window, and the system must handle it.
Both regions accept reads and writes. Async bidirectional replication with eventual consistency. Requires conflict resolution strategy.
Partition-by-user: the no-conflict pattern. The most common active-active architecture avoids conflicts entirely by partitioning data by user or tenant. Each user has a home region determined by geography, account creation location, or explicit choice. All writes for that user go to their home region only. Other regions have a read replica for cross-region reads (e.g., when a US user views an EU user's profile). This is active-active at the fleet level but single-writer at the row level.
When partition-by-user breaks. Shared resources that multiple users write concurrently: a collaborative document, a shared shopping cart, an inventory counter, a leaderboard. These entities have no single "home user" and cannot be cleanly partitioned. For these, you need real conflict resolution — LWW, CRDTs, or application merge.
Replication lag monitoring. Async replication lag is the heartbeat of active-active. Monitor it continuously. Normal: <1 second. Concerning: 1–5 seconds. Critical: >5 seconds (risk of significant data divergence on failover). Alert on lag spikes and investigate causes: network congestion, large transactions, schema migrations that block replication.
Read-your-own-writes. After a US user writes a post, they immediately read the page and expect to see their post. If the read hits the EU replica (which has not received the write yet), the user sees stale data. Solutions: (1) sticky sessions route the user to their home region for reads after a write (simplest), (2) read-after-write consistency via synchronous cross-region read confirmation (expensive), (3) client-side optimistic UI shows the write immediately and reconciles later.
Pros
- +Lowest latency — users write to nearest region
- +No failover needed — both regions are live
- +Scales total write throughput across regions
Cons
- −Conflict resolution is mandatory for shared data
- −Replication lag creates consistency windows
- −Operational complexity ~3× single-region
Choose this variant when
- Global user base needs sub-100 ms write latency
- Data model supports partition-by-user
- Team has SRE capacity for multi-region operations
Active-active — strong consistency
Both regions serve reads and writes with synchronous replication or distributed consensus. No conflicts, but write latency includes cross-region RTT.
Both regions read+write with bidirectional replication. A conflict resolver inspects concurrent writes and applies LWW or CRDT merge before persisting.
Active-active with strong consistency eliminates the conflict-resolution problem entirely: every write is confirmed in all regions before being acknowledged. No replication lag, no stale reads, no conflict resolution. The price is write latency — every write pays the full cross-region round-trip time (80–200 ms).
Both regions read+write with bidirectional replication. A conflict resolver inspects concurrent writes and applies LWW or CRDT merge before persisting.
How it works. Distributed databases like CockroachDB, Google Spanner, and YugabyteDB use Raft or Paxos consensus across regions. A write is proposed by the local node, replicated to a quorum of nodes across regions, and committed only when a majority acknowledges. The write latency equals the time to reach quorum — typically one cross-region RTT for a 3-region setup (the local node + one remote node form a majority of 3).
Spanner and TrueTime. Google Spanner achieves globally consistent reads by using GPS-synchronised clocks (TrueTime) with bounded uncertainty (~7 ms). Reads wait out the uncertainty window to guarantee they see all committed writes. This is an engineering marvel but requires specialised hardware (GPS receivers in every data centre). CockroachDB approximates this with hybrid logical clocks (HLCs) — slightly larger uncertainty windows but no specialised hardware.
When strong consistency is worth the latency. Financial systems where double-spending or phantom reads cause real money loss. Inventory systems where overselling is worse than slower checkout. Any domain where "eventually consistent" means "eventually wrong." Name the use case explicitly — most systems do not need cross-region strong consistency and should not pay the latency tax.
Geo-partitioned leaseholders. CockroachDB and Spanner allow pinning table partitions to specific regions. EU user data has its leaseholder in the EU region, so EU reads and writes hit local storage with no cross-region penalty. Only cross-partition transactions (rare) pay the full RTT. This combines the performance of partition-by-user with the consistency of distributed consensus.
Pros
- +Zero RPO — no data loss on regional failure
- +No conflict resolution needed
- +Globally consistent reads
Cons
- −Write latency includes cross-region RTT (80–200 ms)
- −Requires distributed consensus infrastructure
- −Higher operational complexity and cost
Choose this variant when
- Business cannot tolerate any data loss (RPO = 0)
- Domain requires strict consistency (financial, inventory)
- Write latency of 100–200 ms is acceptable
Scaling path
Single region — one origin, no DR
Ship the product. All traffic in one region.
All users hit one region. A regional outage means total site outage. Simplest but riskiest.
All users hit one region. Multi-AZ for hardware fault tolerance. No cross-region replication. Simplest to operate. Regional outage = total outage.
What triggers the next iteration
- Regional outage takes the entire product offline
- Distant users pay 100–300 ms RTT per request
- Cannot meet data-residency regulations for multiple jurisdictions
Active-passive — warm standby with async replication
Add DR. Survive a full regional failure with minutes of RTO.
Primary region serves all traffic. Standby region has a read replica updated via async replication. Failover requires DNS cut-over.
Primary region serves all traffic. Standby region maintains an async replica. Automated health checks trigger DNS failover on primary failure. Quarterly failover drills validate the setup.
What triggers the next iteration
- Failover takes 2–5 minutes (DNS TTL + promotion)
- Standby resources idle in steady state
- Global users still hit the primary — no latency benefit
Active-active — partition-by-user, async replication
Serve users from their nearest region. Eliminate failover RTO.
Both regions accept reads and writes. Async bidirectional replication with eventual consistency. Requires conflict resolution strategy.
Both regions accept traffic. Users partitioned by home region — each user writes only to their home region. Async replication provides cross-region read replicas. No conflicts by construction. Zero-RTO for the fleet — if one region fails, the other absorbs its users.
What triggers the next iteration
- Shared resources (inventory, leaderboards) need real conflict resolution
- Replication lag creates read-your-writes consistency challenges
- Operational complexity ~3× single-region
Active-active — distributed consensus, strong consistency
Globally consistent reads and writes. Zero RPO, zero RTO.
Both regions read+write with bidirectional replication. A conflict resolver inspects concurrent writes and applies LWW or CRDT merge before persisting.
Distributed consensus (Raft/Paxos) across regions. Every write confirmed by a quorum before ack. No conflicts, no replication lag. Write latency includes cross-region RTT. Geo-partitioned leaseholders minimise latency for partitioned data.
What triggers the next iteration
- Write latency 80–200 ms for cross-region consensus
- Distributed consensus infrastructure is complex to operate
- Cost of running consensus nodes in every region
Deep dives
Active-passive mechanics — WAL streaming, promotion, and DNS cut-over
Primary serves all writes. Standby replays WAL stream continuously. On failure, health check triggers DNS cut-over to standby which is promoted to primary.
Active-passive failover has three phases: detection, promotion, and redirection. Each phase has its own failure modes, and the total RTO is the sum of all three.
Primary serves all writes. Standby replays WAL stream continuously. On failure, health check triggers DNS cut-over to standby which is promoted to primary.
Phase 1: Detection. Health checkers (Route 53, Cloudflare Health Checks, custom probes) ping the primary region every 10–30 seconds. A failure is declared after N consecutive failures (typically 3). Detection time: 30–90 seconds. False positives (transient network blip triggers failover) are the main risk — set the threshold high enough to avoid flapping but low enough for timely detection.
Phase 2: Promotion. The standby database is promoted from read-replica to read-write primary. In PostgreSQL, this is 'pg_ctl promote' or an RDS API call. The promotion itself takes 10–30 seconds. During promotion, the standby replays any remaining WAL segments — this is the RPO window. If async replication was 2 seconds behind, those 2 seconds of transactions are lost (or must be reconciled manually after the original primary recovers).
Phase 3: Redirection. DNS records are updated to point the domain at the standby region's load balancer. DNS propagation depends on TTL: a 60-second TTL means most resolvers switch within 60–120 seconds; a 300-second TTL means up to 5 minutes. During propagation, some users still hit the (dead) primary. Use low DNS TTLs (30–60 s) in production, and consider AWS Global Accelerator or Cloudflare Anycast for instant redirection that bypasses DNS TTL entirely.
Total RTO. Detection (60 s) + promotion (20 s) + DNS propagation (60 s) = ~2.5 minutes best case. Manual failover with a human in the loop: add 5–30 minutes for paging, assessment, and decision. Automated failover is the only way to hit sub-5-minute RTO consistently.
The recovery problem. After failover, the old primary is now a stale, possibly corrupted node. Recovery options: (1) rebuild it as a new replica from the new primary (safest, slowest), (2) replay its divergent WAL and reconcile (risky, complex), (3) discard it and provision a fresh standby. Most teams choose option 1 — treat the old primary as disposable.
Active-active conflict resolution — LWW, CRDTs, and partition-by-user
Region A writes X=1 at T1. Region B writes X=2 at T2 (T2 > T1). Async replication delivers both. LWW resolver picks B (higher timestamp). Final value: X=2.
Conflict resolution is the defining challenge of active-active architectures. Two regions write the same key concurrently — which write survives?
Region A writes X=1 at T1. Region B writes X=2 at T2 (T2 > T1). Async replication delivers both. LWW resolver picks B (higher timestamp). Final value: X=2.
Last-write-wins (LWW). Each write carries a timestamp. When two writes to the same key arrive via replication, the one with the higher timestamp wins; the other is silently discarded. Implementation: DynamoDB Global Tables, Cassandra, Cosmos DB default policy.
Strengths: trivially simple, no application changes. Weaknesses: data loss. If Region A sets price=100 at T1 and Region B sets price=200 at T2, LWW keeps 200 and drops 100 — even if 100 was the correct price. Clock skew between regions can cause the "older" write to win if its clock is ahead. NTP synchronisation helps but cannot eliminate skew entirely.
CRDTs (Conflict-free Replicated Data Types). Mathematical structures that merge without conflicts:
- G-Counter (grow-only counter): each region maintains its own counter; the merged value is the sum. Perfect for view counts, like counts.
- PN-Counter (positive-negative counter): two G-Counters — one for increments, one for decrements. Net value = positive - negative. Works for inventory with add/remove.
- OR-Set (observed-remove set): elements can be added and removed without conflicts. A remove only affects adds the remover has observed. Good for shopping carts, tag sets.
- LWW-Register: a single value with a timestamp — essentially LWW for one field. Useful for profile fields where the latest edit should win.
CRDTs are powerful but limited: they cannot express arbitrary business rules (e.g., "reject if balance would go negative"). They shine for specific data shapes — counters, sets, collaborative text — and should be used surgically, not as a general solution.
Partition-by-user. Assign each user (or tenant, or entity) a home region. All writes for that entity go to its home region only. Other regions have read replicas. No conflicts because each row has exactly one writer. This is the default recommendation for most active-active systems: WhatsApp, Slack, Notion, most SaaS platforms. The 1% of requests from travelling users are proxied cross-region — a small latency penalty for operational simplicity.
Application merge. Present both versions to the user and let them choose. Google Docs uses Operational Transformation (OT); Figma uses CRDTs. This is the right approach for collaborative editing where human intent matters, but it is heavy to build and confusing for non-collaborative use cases.
DNS-based routing — latency, geo, and failover policies
Client queries DNS. DNS resolves to the nearest region based on latency measurements. Regional LB distributes to app tier.
DNS is the first routing layer in a multi-region architecture. The DNS resolver decides which region's IP address to return to the client, and that decision shapes the user's entire request path.
Client queries DNS. DNS resolves to the nearest region based on latency measurements. Regional LB distributes to app tier.
Latency-based routing. The DNS provider measures round-trip time from the resolver to each region's health endpoint and returns the IP of the region with the lowest latency. AWS Route 53, Cloudflare, and Google Cloud DNS all support this. It works well when regions are well-distributed and the resolver is geographically close to the user (which is usually but not always true — some ISPs route DNS through distant resolvers).
Geolocation routing. Returns the IP of the region mapped to the user's geographic location (determined by the resolver's IP or EDNS Client Subnet). Simpler than latency-based but less accurate — a user in eastern Canada might be closer to EU-West than US-East. Use geo routing when data residency requires region affinity (EU users must hit EU region) rather than pure latency optimisation.
Failover routing. Primary/secondary configuration: DNS returns the primary region's IP unless health checks fail, then switches to the secondary. This is the DNS layer for active-passive failover. Health check interval (10–30 s) × failure threshold (3) = detection time. TTL determines propagation speed. Low TTLs (30–60 s) speed up failover but increase DNS query volume.
Weighted routing. Distribute traffic by percentage: 90% to US, 10% to EU. Useful for canary deployments (route 5% of traffic to a new region) or gradual migration from one region to another. Not a steady-state routing strategy.
DNS TTL trade-off. Low TTL (30 s): fast failover, high DNS query volume, slightly higher latency (more DNS lookups). High TTL (300 s): slow failover, lower DNS volume, cached results serve faster. Production recommendation: 60 s TTL for services that need automated failover. If using Anycast (Cloudflare, AWS Global Accelerator), DNS TTL matters less because the Anycast IP does not change — routing shifts at the network layer instead.
Beyond DNS: Anycast and Global Accelerator. DNS-based routing has a fundamental limitation: TTL-bound propagation delay. Anycast eliminates this: the same IP address is announced from multiple regions via BGP. The network routes each packet to the nearest announcing region. Failover is instant — BGP reconvergence happens in seconds, not minutes. AWS Global Accelerator and Cloudflare use Anycast for the edge, terminating TCP at the nearest PoP and forwarding to the backend region. This is the fastest failover mechanism available.
Failover automation — health checks, traffic shifting, and runbooks
Health checker detects primary failure. DNS updated to point at standby. Standby promoted to primary. Traffic shifts within TTL window.
Automated failover is the difference between a 2-minute RTO and a 30-minute RTO. Manual failover requires paging an on-call engineer, assessing the situation, deciding to fail over, and executing the runbook — all while the product is down. Automated failover compresses this to detection + promotion + redirection, all executed by software.
Health checker detects primary failure. DNS updated to point at standby. Standby promoted to primary. Traffic shifts within TTL window.
Health check design. A good health check verifies the entire request path, not just "is the server up." The health endpoint should: (1) query the database (proves DB connectivity), (2) check cache connectivity, (3) verify that the application can process a request end-to-end. Return 200 if all checks pass, 503 if any fail. The health checker (Route 53, CloudWatch, custom) polls this endpoint every 10–30 seconds.
Failure detection thresholds. Require N consecutive failures (typically 3) before declaring a region down. This prevents flapping: a transient network glitch or a brief GC pause should not trigger a failover. But the threshold must be low enough for timely detection. At 10-second intervals and 3 failures: detection takes 30 seconds worst case. At 30 seconds × 3: 90 seconds. Choose based on your RTO budget.
Traffic shifting strategies. (1) DNS failover: update DNS records. Propagation bounded by TTL — 30–300 seconds. (2) Global Accelerator / Anycast: shift traffic at the network layer. Propagation in seconds via BGP. (3) Application-layer routing: a global load balancer (Cloudflare, AWS ALB with Global Accelerator) routes based on backend health. Instant for new connections; existing connections drain.
Automated promotion. Cloud providers offer managed failover: RDS Multi-AZ promotes the standby automatically, Aurora Global Database promotes a secondary cluster. For self-managed databases, promotion scripts must be battle-tested: stop replication, promote replica, update connection strings, validate data integrity, report status.
Failover drills. The most important operational practice in multi-region is scheduled failover in production. At least quarterly, intentionally fail over to the standby region during business hours with the team watching. This catches: expired credentials, drifted configs, broken runbooks, untested promotion scripts, DNS TTL surprises, and monitoring gaps. Netflix's Chaos Monkey and Gremlin are tools for this, but a simple "turn off the primary region in the console" is the minimum viable drill.
Failback. After failover, the old primary must be recovered (rebuilt as a replica of the new primary) and traffic gradually shifted back. Failback is often harder than failover because the old primary may have divergent data. Plan for it explicitly — many teams discover failback is broken only after their first real failover.
Data residency and compliance — GDPR, PIPL, and region-pinning
Data residency requirements constrain where user data can physically reside and how it can be transferred across borders. These constraints are not optional — GDPR fines reach 4% of global revenue, and PIPL violations can result in app bans in China.
GDPR (EU). Personal data of EU residents must be processed under EU data protection law. This does not strictly require data to stay in the EU — it requires "adequate protection" (adequacy decisions, Standard Contractual Clauses, Binding Corporate Rules). However, the simplest compliance architecture is region-pinning: EU users' data lives in an EU region and never crosses borders.
PIPL (China). Stricter than GDPR for cross-border transfers. Personal information of Chinese users generally must be stored in China and can only be exported after security assessment by the Cyberspace Administration of China. In practice, most companies run a fully isolated China region with no cross-border data flow.
Architecture implications.
- 1Region affinity per user. Each user is assigned a home region based on their legal jurisdiction at account creation. All their personal data is stored in that region. DNS routing sends them to their home region. This is partition-by-user with a legal constraint on the partition key.
- 1No cross-border replication of PII. Async replication of EU PII to a US region may violate GDPR. Either exclude PII from cross-region replication (replicate only anonymised or aggregated data) or encrypt PII with EU-controlled keys that cannot be decrypted in the US.
- 1Audit trail. Maintain records of where each piece of data resides, when it was created, and whether it was ever transferred. GDPR Article 30 requires this. Use database-level tagging or a metadata service that tracks data residency per user.
- 1Right to erasure (Article 17). When a user requests deletion, all copies across all regions must be deleted — including replicas, backups, and cached data. Multi-region makes this harder because data may exist in multiple replicas. Build a deletion pipeline that propagates erasure requests to all regions and confirms completion.
Interview signal. When a candidate mentions multi-region, asking "how do you handle GDPR?" tests whether they understand that multi-region is not just a latency and DR pattern — it is a compliance architecture. Name region-pinning, PII exclusion from replication, and deletion propagation.
Cross-region cache coherence — keeping Redis / Memcached consistent across regions
In a multi-region architecture, each region typically has its own cache cluster (Redis, Memcached). The challenge: when data changes in Region A's database, Region A's cache is invalidated — but Region B's cache still holds stale data. Without cross-region cache invalidation, users in Region B see stale content until the cache's TTL expires.
Option 1: Short TTLs (simplest). Set cache TTL to 30–60 seconds. Stale data is bounded by the TTL. No cross-region invalidation needed. Works when seconds of staleness are acceptable — product catalogues, user profiles, content feeds. Most systems should start here.
Option 2: Event-driven invalidation. When a write occurs in Region A, publish an invalidation event to a cross-region message bus (Kafka MirrorMaker, SNS + SQS, Pub/Sub). Each region's cache subscriber receives the event and deletes the stale key. Propagation time: 0.5–5 seconds (replication lag + processing). More complex but bounds staleness to seconds regardless of TTL.
Option 3: Write-through to remote caches. On every write, the application updates the local cache AND sends an update to remote region caches. This is eager invalidation — the remote cache has fresh data almost immediately. The risk: if the remote cache update fails (network partition), the remote cache is silently stale. Use a retry queue to handle transient failures.
Option 4: Redis Global Replication. AWS ElastiCache Global Datastore and Redis Enterprise Active-Active replicate cache data across regions automatically. The cache replication runs independently of database replication. This is the managed solution: zero application code for cross-region cache consistency. Trade-off: vendor lock-in and cost.
Cache stampede across regions. If a popular key expires simultaneously in all regions, all regions miss and hit the database concurrently — a global cache stampede. Solutions: (1) jitter TTLs by ± random seconds so expiry is desynchronised, (2) use SWR-style lazy refresh (serve stale, refresh async), (3) single-flight / request coalescing per region so only one process refills the cache.
Interview tip. When discussing multi-region caching, name the trade-off: short TTLs are simple but allow seconds of staleness; event-driven invalidation is tighter but requires a cross-region messaging layer. Don't claim "Redis handles it" — Redis replication lag is real (100–500 ms cross-region), and the application must decide what "consistent" means for each cached entity.
Case studies
Netflix multi-region — active-active with Zuul, EVCache, and Cassandra
Netflix runs active-active across three AWS regions (us-east-1, us-west-2, eu-west-1). Every region can serve 100% of global traffic independently — there is no "primary" region. This is full active-active with no failover concept: if a region fails, the other two absorb its traffic immediately via DNS-based routing.
Routing. Netflix uses internal DNS (Denominator) and Zuul (API gateway) for region-aware routing. Users are routed to the nearest region by latency. Within a region, Zuul routes to backend microservices.
Data layer. Cassandra is the primary data store, configured for multi-region replication with LOCAL_QUORUM consistency for reads and writes. This means each region reads and writes to its local Cassandra ring with quorum consistency, and replication to other regions is asynchronous. Conflict resolution uses LWW (Cassandra's default) — acceptable because Netflix's data model is append-heavy (viewing history, bookmarks) where LWW is semantically correct.
Caching. EVCache (a Memcached wrapper) runs per-region. Cache invalidation is event-driven: when a write occurs, an invalidation event is published to a cross-region queue, and each region's EVCache subscriber deletes the stale key. Typical invalidation propagation: 1–3 seconds.
Failover drills. Netflix runs "region evacuation" exercises regularly, routing all traffic away from one region to validate that the remaining regions handle the load. This is Chaos Engineering at the regional level — not just killing individual instances but simulating a full region failure.
Numbers. 200M+ subscribers, hundreds of microservices per region, ~1M requests/second per region at peak. Each region is fully autonomous — it has its own Cassandra ring, EVCache cluster, and full microservice fleet. Cross-region data transfer: ~100 TB/month for replication.
Takeaway
Netflix active-active works because the data model is append-heavy (viewing history, bookmarks) where LWW is semantically correct. For write-conflicting data, they use partition-by-user. The lesson: choose active-active only when your data model supports it.
CockroachDB geo-partitioning — strong consistency with region-pinned leaseholders
CockroachDB is a distributed SQL database that provides serializable consistency across regions using Raft consensus. Its geo-partitioning feature directly addresses the multi-region latency vs consistency trade-off.
How geo-partitioning works. Tables can be partitioned by a column (e.g., country or region). Each partition's Raft leaseholder (the node that serves reads and coordinates writes) is pinned to a specific region. EU user data has its leaseholder in the EU region; US user data in the US region. Reads and writes for partitioned data hit local storage — no cross-region RTT.
Cross-partition transactions. When a transaction spans partitions in different regions (e.g., a US user transfers money to an EU user), the transaction coordinator must reach consensus across regions. This pays the full cross-region RTT (80–200 ms) but guarantees serializability. In practice, cross-partition transactions are rare (<1% of queries) — most operations touch a single user's data.
Compliance. Geo-partitioning directly satisfies GDPR data residency: EU users' data physically resides on EU nodes and never leaves. CockroachDB's constraint-based placement ensures replicas of EU partitions are also in EU zones.
Numbers (public benchmarks). Single-region write latency: 2–5 ms. Cross-region write latency (US ↔ EU consensus): 80–120 ms. Read latency (local leaseholder): 1–3 ms. CockroachDB achieves 100K+ TPS per region with geo-partitioning, scaling horizontally by adding nodes.
Trade-off. CockroachDB trades write latency for consistency. For workloads where 100 ms write latency is unacceptable (real-time gaming, high-frequency trading), async replication with application-level conflict resolution is faster. But for most SaaS, e-commerce, and financial applications, the consistency guarantee is worth the latency.
Takeaway
CockroachDB geo-partitioning shows that strong consistency and data residency can coexist with acceptable latency — the key is pinning leaseholders to regions so most operations are local.
Slack active-active migration — from single-region to partitioned multi-region
Slack migrated from a single-region architecture (us-east-1) to active-active across multiple regions. The migration took over two years and is one of the best-documented active-active migrations in the industry.
The partition-by-workspace strategy. Slack partitions data by workspace (team). Each workspace has a home region determined by the workspace creator's geography. All messages, channels, files, and metadata for that workspace live in its home region. Users in other regions who belong to the same workspace are routed to the workspace's home region — not their nearest region. This is partition-by-entity, not partition-by-user.
Why workspace, not user? A user can belong to multiple workspaces across regions. If data were partitioned by user, a single channel with members in different regions would require cross-region writes on every message. By partitioning by workspace, all members of a workspace write to the same region, eliminating write conflicts for the most common operation (sending a message).
Migration approach. Slack migrated workspaces one at a time from us-east-1 to their target home region. Each migration involved: (1) creating the workspace's data in the target region, (2) replicating historical data, (3) switching the workspace's routing to the new region, (4) verifying data integrity, (5) deleting the old data. This was a live migration with zero downtime per workspace.
Cross-region operations. Some operations span regions: shared channels (Enterprise Grid feature where two workspaces in different regions share a channel), user profile updates (a user's profile must be consistent across all their workspaces), and search indexing (global search across all workspaces a user belongs to). These use async event-driven replication with eventual consistency — acceptable because a few seconds of delay for cross-region profile updates is not user-visible.
Numbers. Millions of concurrent connections, billions of messages per month. Each region handles 100K+ WebSocket connections per host. Cross-region replication bandwidth: tens of GB/hour for event streams. Migration moved petabytes of historical data without user-facing downtime.
Takeaway
Slack partitions by workspace (not by user) because the collaboration unit is the workspace. This eliminates write conflicts for the dominant operation (messaging) while accepting async consistency for cross-workspace features.
Decision levers
Posture (single / AP / AA)
Default single region. Active-passive when DR is required (RTO < 15 min). Active-active only when regional latency OR compliance forces it. Active-active is ~3× the ops commitment of single-region — the business must fund the SRE team, not just the infrastructure.
Replication mode (sync / async)
Async is the default: no write latency penalty, RPO of seconds. Semi-sync for tighter RPO (<1 s). Full sync for RPO=0 — but writes pay 80–200 ms cross-region RTT. Match the mode to the business's stated RPO requirement.
Conflict resolution strategy
Partition-by-user whenever possible — no conflicts by construction. LWW for ephemeral data (caches, session state). CRDTs for counters and sets. Application merge as last resort for collaborative editing. Never use LWW for business-critical data without documenting the data-loss risk.
DNS routing policy
Latency-based routing is the default. Geolocation routing when compliance requires region affinity. Failover routing for active-passive. Use 60-second TTL for fast failover. Add Anycast (Global Accelerator, Cloudflare) when DNS TTL propagation delay is unacceptable.
Failover automation level
Manual failover is acceptable for RTO >30 min. Automated failover for RTO <5 min. Fully automated requires: health checks (30 s interval × 3 threshold), auto-promotion, auto-DNS-update. Drill quarterly in production regardless of automation level.
Failure modes
Two regions write the same row; the loser's write is silently dropped. Unacceptable for payments, messages, or any auditable business data. Use partition-by-user or distributed consensus instead.
Active-passive that has never failed over will fail when you need it. Configs drift, creds expire, promotion scripts have untested code paths. Quarterly production drills are the only mitigation.
100 ms transatlantic RTT on every login or page load. Design writes to stay in-region; cross-region calls must be async. If sync is required (inventory check), budget the latency explicitly.
Both regions believe they are primary and accept writes independently. When the partition heals, conflicting writes must be reconciled — possibly with data loss. Prevention: fencing tokens, lease-based leadership, witness nodes.
Aurora Global, DynamoDB Global Tables, and Cosmos DB handle replication and low-level conflict detection. They do NOT define business-level conflict policy. "Two regions updated the same order" is your problem, not the database's.
NTP skew between regions can cause a logically-earlier write to win LWW if its clock is ahead. Hybrid logical clocks (HLCs) mitigate this but add complexity. If using LWW, monitor clock skew and alert on divergence >100 ms.
Async replication of EU PII to a US region may violate GDPR. Exclude PII from cross-region replication or encrypt with EU-controlled keys. Audit the replication stream for PII leaks.
Decision table
Multi-region posture comparison
| Dimension | Single Region | Active-Passive | AA Eventual | AA Strong |
|---|---|---|---|---|
| RTO | Hours (manual) | 2–5 min | ~0 (both live) | ~0 (both live) |
| RPO | Last backup | 0.1–5 s (async lag) | 0.1–5 s (async lag) | 0 (consensus) |
| Write latency | Local (~5 ms) | Local (~5 ms) | Local (~5 ms) | +80–200 ms (RTT) |
| Conflict risk | None | None | Yes (shared data) | None (consensus) |
| Ops complexity | 1× | 2× | 3× | 3–4× |
| Infra cost | 1× | ~2× | ~2.5× | ~3× |
| GDPR compliance | Trivial | Region-pin standby | Region-pin home | Geo-partition |
- Active-passive is the right default for DR. Active-active only when latency or compliance demands it.
- AA eventual with partition-by-user eliminates most conflict-resolution complexity.
- AA strong (distributed consensus) is warranted only for financial / inventory data where RPO=0 is non-negotiable.
Worked example
Worked example: global e-commerce with US + EU regions
Prompt: "Design a global e-commerce platform serving customers in North America and Europe. GDPR compliance required for EU users. Target: 99.99% availability, <100 ms P95 latency for product page loads."
Step 1: Choose the multi-region posture
The requirements demand both latency (<100 ms P95 rules out a single US region for EU users) and compliance (GDPR requires EU data residency). Active-passive does not solve the latency problem — EU users would still hit a US primary. Active-active with partition-by-user is the right posture: each user has a home region (US or EU) determined by their shipping address or account creation geography.
GeoDNS routes users to the nearest region. Each region has its own app tier and DB. Async bidirectional replication keeps data eventually consistent across regions.
Step 2: Data partitioning
Partition users by home region. A US user's orders, cart, profile, and payment methods live in the US region. An EU user's data lives in the EU region. This satisfies GDPR: EU PII never leaves the EU region. Cross-region data: product catalogue (shared, read-only in both regions via async replication) and aggregated analytics (no PII).
Step 3: DNS routing
Use latency-based DNS (Route 53 or Cloudflare) with 60-second TTL. US users resolve to the US LB; EU users resolve to the EU LB. The rare case of a US user travelling in Europe: they hit the EU region's edge, which proxies writes to the US region for their account. Adds ~100 ms but affects <1% of requests.
Step 4: Database replication
Each region has its own PostgreSQL cluster (multi-AZ within the region). Product catalogue replication: async from a canonical source to both regions. User data: NO cross-region replication of PII. EU user data exists only in the EU region. US user data exists only in the US region. This eliminates conflict resolution for user data and satisfies GDPR.
Step 5: Inventory — the hard problem
Inventory is shared across regions (a product has one global stock count). Options:
- Centralised inventory in one region. Both regions call the US inventory service for stock checks and reservations. Adds 80 ms cross-region RTT for EU checkout. Acceptable if checkout latency SLO is 300 ms.
- CRDT counter. Each region maintains a local stock counter. Decrements are CRDTs (PN-Counter). Merge is conflict-free. Risk: brief overselling during replication lag window. Mitigate with a small buffer (reserve 5% of stock as safety margin).
Choose centralised inventory for correctness; accept the latency on checkout (which is not in the <100 ms P95 target — that is for product page reads).
Step 6: Caching
Each region has its own Redis cluster. Product catalogue cached with 60-second TTL. Invalidation: async event from the catalogue service triggers cache delete in both regions. User-specific data cached locally, never replicated cross-region.
Step 7: Failover
If the US region fails: US users are re-routed to the EU region via DNS. The EU region can serve product pages and browsing (product catalogue is replicated). US user data (orders, cart) is unavailable until the US region recovers — but the site is not down. Display a banner: "Some account features are temporarily unavailable." RTO for browsing: <2 minutes (DNS). RTO for US user data: depends on US region recovery.
If the EU region fails: mirror scenario. EU user data unavailable; browsing works from US region.
Step 8: Numbers
| Metric | Value |
|---|---|
| Regions | 2 (US-East, EU-West) |
| Product page P95 latency | <50 ms (local CDN + cache) |
| Checkout latency (cross-region inventory) | ~200 ms |
| RTO (browsing) | <2 minutes |
| RPO (user data) | 0 (no cross-region replication of user data) |
| Infrastructure cost vs single-region | ~2.5× |
| GDPR compliance | By construction (region-pinned PII) |
Interview playbook
When it comes up
- Prompt mentions "global user base" or "users on multiple continents"
- Data-residency or GDPR compliance requirement
- DR / availability requirement with RTO < 15 minutes
- "What happens if us-east-1 goes down?"
- Explicit multi-region / geo-distributed requirement
Order of reveal
- 1Name the posture. Start with the posture: single-region, active-passive, or active-active. Default to active-passive for DR; active-active only when latency or compliance demands it. Active-active is 3× the ops complexity.
- 2Define the partition strategy. Partition by user or tenant — each entity has a home region. Writes stay in-region. No conflicts by construction. This is what Slack, WhatsApp, and most global SaaS use.
- 3Choose the replication mode. Async replication for most data — seconds of lag is acceptable. Semi-sync for financial data where RPO must be <1 second. Full sync only if the business demands zero RPO and can tolerate 100+ ms write latency.
- 4Design DNS routing. Latency-based DNS with 60-second TTL. Route 53 or Cloudflare. For instant failover, add Global Accelerator or Anycast to bypass DNS TTL.
- 5Address conflict resolution. Partition-by-user eliminates conflicts for user data. Shared resources (inventory, leaderboards) need explicit handling — centralised service, CRDT counters, or LWW with documented data-loss risk.
- 6Describe failover mechanics. Automated health checks detect failure in 30–60 seconds. DNS failover in 60 seconds. Standby promoted in 20 seconds. Total RTO: 2–3 minutes. Drill quarterly in production.
- 7Name compliance constraints. GDPR: region-pin EU user PII. No cross-border replication of PII unless adequate safeguards exist. Build deletion propagation for right-to-erasure. This is not an afterthought — it constrains the partition design.
Signature phrases
- “Active-active is a conflict-resolution project, not a deployment project.” — Signals that you understand the core complexity of multi-region, not just "deploy to two regions."
- “Partition by user — no conflicts by construction.” — Names the most practical active-active pattern that most systems should use.
- “Untested failover does not work. Drill quarterly.” — Shows operational maturity — configs drift, creds expire, scripts rot.
- “Cross-region RTT is 80–100 ms. Keep writes in-region.” — Quantitative constraint that drives the partition design.
- “LWW silently loses data — fine for caches, not for payments.” — Shows nuanced understanding of conflict resolution trade-offs.
- “Multi-region triples ops complexity, not just infra cost.” — Honest cost framing that interviewers respect.
Likely follow-ups
?“How do you handle a user who travels to a different region?”Reveal
The user hits the nearest region's edge. The edge detects their home region from their auth token or account metadata and proxies writes to the home region. Reads can be served locally from the replica. The cross-region proxy adds ~80–100 ms to writes — acceptable for the <1% of users who are travelling. If travel is common (mobile workforce), consider sticky sessions or a lightweight region-local write buffer with async forwarding.
?“What happens to in-flight writes during failover?”Reveal
With async replication, writes that were committed on the primary but not yet replicated to the standby are lost (RPO = replication lag, typically 0.1–5 seconds). The application should log write confirmations and, after failover, reconcile lost writes from application-level logs or let the old primary replay them when it recovers. For zero RPO, use synchronous replication — but accept the write latency penalty.
?“How do you avoid split-brain in active-passive?”Reveal
Split-brain occurs when both regions think they are the primary. Prevention: (1) fencing — the promoted standby sends a fencing token to the old primary's storage layer, blocking its writes. (2) Lease-based leadership — the primary holds a lease that expires; the standby cannot promote until the lease expires. (3) Witness node — a third, lightweight node in a third AZ or region breaks ties. Cloud-managed databases (RDS, Aurora) handle this internally.
?“How do you test multi-region?”Reveal
Three layers: (1) integration tests with simulated latency (tc netem or Toxiproxy) to validate that cross-region calls do not exceed latency budgets, (2) chaos testing — kill one region's app tier and verify failover completes within RTO, (3) quarterly production failover drill — actually move traffic to the standby and verify the entire stack works end-to-end. Test the failback too — recovering the old primary is often harder than the initial failover.
?“What is the cost of multi-region?”Reveal
Roughly 2.5–3× single-region: full compute and storage in two regions, plus cross-region data transfer ($0.02/GB on AWS). But the real cost is operational: separate deploy pipelines, cross-region monitoring, 24/7 on-call for regional issues, replication lag management, failover runbooks. If the business needs multi-region, it is also committing to the SRE team to run it.
?“How do you handle schema migrations across regions?”Reveal
Online schema migration tools (pt-online-schema-change, gh-ost for MySQL; pg_repack for Postgres) run in the primary region and replicate the schema change to the standby via the replication stream. For active-active, schema changes must be backward-compatible: add columns with defaults, never rename or drop columns in the same release. Use expand-contract migrations: expand (add new column), migrate (backfill data, update code), contract (remove old column in a later release).
Code snippets
// Health check that verifies the full request path
// Returns 200 if healthy, 503 if any dependency is down
import { Pool } from 'pg';
import Redis from 'ioredis';
const db = new Pool({ connectionString: process.env.DATABASE_URL });
const redis = new Redis(process.env.REDIS_URL!);
interface HealthStatus {
status: 'healthy' | 'unhealthy';
checks: Record<string, boolean>;
region: string;
timestamp: string;
}
export async function healthCheck(): Promise<HealthStatus> {
const region = process.env.AWS_REGION ?? 'unknown';
const checks: Record<string, boolean> = {};
// Check database connectivity
try {
await db.query('SELECT 1');
checks['database'] = true;
} catch {
checks['database'] = false;
}
// Check cache connectivity
try {
await redis.ping();
checks['cache'] = true;
} catch {
checks['cache'] = false;
}
const allHealthy = Object.values(checks).every(Boolean);
return {
status: allHealthy ? 'healthy' : 'unhealthy',
checks,
region,
timestamp: new Date().toISOString(),
};
}// Last-write-wins conflict resolution using HLC timestamps
// HLC = max(local wall clock, last seen remote clock) + counter
interface HLCTimestamp {
wallMs: number; // wall-clock milliseconds
logical: number; // logical counter for same-ms events
nodeId: string; // region identifier
}
function hlcNow(nodeId: string, last: HLCTimestamp): HLCTimestamp {
const wall = Date.now();
if (wall > last.wallMs) {
return { wallMs: wall, logical: 0, nodeId };
}
return { wallMs: last.wallMs, logical: last.logical + 1, nodeId };
}
function hlcCompare(a: HLCTimestamp, b: HLCTimestamp): number {
if (a.wallMs !== b.wallMs) return a.wallMs - b.wallMs;
if (a.logical !== b.logical) return a.logical - b.logical;
return a.nodeId.localeCompare(b.nodeId); // deterministic tiebreak
}
// Resolve conflict: pick the write with the higher HLC
function lwwResolve<T>(
localWrite: { value: T; ts: HLCTimestamp },
remoteWrite: { value: T; ts: HLCTimestamp },
): T {
return hlcCompare(localWrite.ts, remoteWrite.ts) >= 0
? localWrite.value
: remoteWrite.value;
}# AWS Route 53 health check + failover routing policy
# Primary: us-east-1. Secondary: eu-west-1.
Resources:
PrimaryHealthCheck:
Type: AWS::Route53::HealthCheck
Properties:
HealthCheckConfig:
FullyQualifiedDomainName: api-us.example.com
Port: 443
Type: HTTPS
ResourcePath: /health
RequestInterval: 10 # check every 10 seconds
FailureThreshold: 3 # 3 failures = unhealthy
EnableSNI: true
PrimaryRecord:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: api.example.com
Type: A
AliasTarget:
DNSName: us-east-1-lb.example.com
HostedZoneId: !Ref USLBZone
SetIdentifier: primary
Failover: PRIMARY
HealthCheckId: !Ref PrimaryHealthCheck
SecondaryRecord:
Type: AWS::Route53::RecordSet
Properties:
HostedZoneId: !Ref HostedZone
Name: api.example.com
Type: A
AliasTarget:
DNSName: eu-west-1-lb.example.com
HostedZoneId: !Ref EULBZone
SetIdentifier: secondary
Failover: SECONDARY// Monitor replication lag between primary and replica
// Alert when lag exceeds thresholds
interface LagMetrics {
region: string;
lagSeconds: number;
status: 'ok' | 'warning' | 'critical';
measuredAt: string;
}
const LAG_WARNING_THRESHOLD = 2; // seconds
const LAG_CRITICAL_THRESHOLD = 10; // seconds
export async function checkReplicationLag(
primaryPool: import('pg').Pool,
replicaPool: import('pg').Pool,
region: string,
): Promise<LagMetrics> {
// PostgreSQL: check pg_stat_replication on primary
const result = await primaryPool.query(`
SELECT EXTRACT(EPOCH FROM (now() - pg_last_xact_replay_timestamp()))
AS lag_seconds
FROM pg_stat_replication
LIMIT 1
`);
const lagSeconds = parseFloat(result.rows[0]?.lag_seconds ?? '0');
let status: LagMetrics['status'] = 'ok';
if (lagSeconds > LAG_CRITICAL_THRESHOLD) status = 'critical';
else if (lagSeconds > LAG_WARNING_THRESHOLD) status = 'warning';
const metrics: LagMetrics = {
region,
lagSeconds,
status,
measuredAt: new Date().toISOString(),
};
if (status !== 'ok') {
console.error(
`[REPLICATION LAG] ${region}: ${lagSeconds.toFixed(1)}s (${status})`
);
}
return metrics;
}// Route writes to the user's home region based on data residency
// Home region determined at account creation from user geography
type Region = 'us-east-1' | 'eu-west-1' | 'ap-southeast-1';
interface UserRegionMapping {
userId: string;
homeRegion: Region;
}
const REGION_ENDPOINTS: Record<Region, string> = {
'us-east-1': 'https://api-us.example.com',
'eu-west-1': 'https://api-eu.example.com',
'ap-southeast-1': 'https://api-ap.example.com',
};
// Look up user's home region from account metadata
async function getUserHomeRegion(userId: string): Promise<Region> {
// In production: cached in Redis with fallback to account DB
const resp = await fetch(
`https://accounts.internal/users/${userId}/region`
);
const data = (await resp.json()) as UserRegionMapping;
return data.homeRegion;
}
// Route a write to the correct region
export async function routeWrite(
userId: string,
path: string,
body: unknown,
currentRegion: Region,
): Promise<Response> {
const homeRegion = await getUserHomeRegion(userId);
// If already in the home region, write locally
if (homeRegion === currentRegion) {
return fetch(`http://localhost:8080${path}`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify(body),
});
}
// Proxy write to the home region
const endpoint = REGION_ENDPOINTS[homeRegion];
return fetch(`${endpoint}${path}`, {
method: 'POST',
headers: {
'Content-Type': 'application/json',
'X-Proxied-From': currentRegion,
},
body: JSON.stringify(body),
});
}Drills
Active-passive with async replication has 3-second lag. Primary fails. How much data is lost?Reveal
RPO = replication lag at the moment of failure = ~3 seconds of committed transactions. These transactions were confirmed to the client but not yet replicated to the standby. They are lost unless the old primary can be recovered and its WAL replayed. This is why RPO is measured in seconds for async replication, not zero.
Your cloud DB claims "multi-master." Is conflict resolution solved?Reveal
No. Multi-master handles replication and low-level conflict detection (e.g., row-level version vectors). It does not define business-level policy: "two regions both updated order #42 — which update wins?" is your application's decision. The DB's default (usually LWW) may silently lose the important write.
A US user sends a message in a Slack workspace homed in EU. What happens?Reveal
The user's client connects to the nearest edge (US). The edge detects the workspace's home region (EU) and proxies the write to the EU region. The message is stored in the EU database. The US user sees ~100 ms additional latency on the write (cross-region RTT). Reads are served from the EU region's data, possibly via a US read replica with seconds of async lag.
How do you prevent split-brain in active-passive failover?Reveal
Three mechanisms: (1) fencing tokens — the promoted standby sends a token to storage that blocks the old primary's writes, (2) lease-based leadership — the primary holds a time-limited lease; standby cannot promote until the lease expires, (3) witness node in a third location that breaks ties. Cloud-managed databases (RDS Multi-AZ, Aurora) implement fencing internally.
GDPR requires you to delete an EU user's data. They have cached data in the US region. How?Reveal
Publish a deletion event to a cross-region message bus. Each region's subscriber: (1) deletes the user's rows from the database, (2) invalidates all cache keys associated with the user, (3) removes the user from search indexes, (4) marks backups for selective redaction (or waits for backup rotation). Confirm deletion across all regions before acknowledging the request. This is a pipeline, not a single DELETE statement.
Why might you choose active-passive over active-active even with a global user base?Reveal
If the latency SLO is achievable from one region (e.g., 200 ms is acceptable for all users), or if the data model does not support clean partitioning (everything is shared state), or if the team lacks the SRE capacity to operate active-active. Active-passive is half the complexity and still provides DR. The latency cost for distant users may be acceptable given the operational savings.