Replication & durability
Leader/follower, sync vs async replication, write quorum, RPO/RTO.
Replication is how you survive a node death; durability is how you survive a bad deploy. Candidates confuse the two and end up with a design that's highly available but cheerfully corrupt.
Read this if your last attempt…
- You said "we'll replicate the DB" without naming sync vs async
- You can't state your RPO and RTO as numbers
- You don't have a backup-and-restore story separate from replication
- You confuse "highly available" with "durable"
The concept
Replication gives you more than one copy of the live data so a hardware or process failure doesn't take the service down. Durability is the guarantee that once you've acknowledged a write, it will survive any subsequent failure — including bugs, fat-fingered admins, and catastrophic deploys.
Different problems, different tools.
Sync replica = no data loss on leader failure (RPO=0). Async replicas = scale reads cheaply. Pick the mix by cost tolerance.
Replication shapes — when each wins.
| Leader/follower | Multi-leader | Leaderless quorum | |
|---|---|---|---|
| Writes | Single leader | Many leaders | Any N of replicas |
| Failover | Promote follower | Continue — any leader alive | Transparent — no leader |
| Conflict handling | None (single writer) | App-level or CRDTs | Last-write-wins / vector clocks |
| Best for | OLTP, strong consistency | Multi-region active-active | AP stores at planet scale |
| Worst for | Cross-region writes | Strongly-consistent workloads | Relational queries |
| Example systems | Postgres, MySQL, MongoDB | CouchDB, Active-active MySQL | Cassandra, Dynamo, Riak |
How interviewers grade this
- You name RPO and RTO for each data class (e.g. "payments: RPO=0, RTO<30s; analytics: RPO=15min, RTO=1h").
- You distinguish sync from async replicas and say why you chose the split.
- You have a backup story separate from replication — PITR, off-region snapshots, tested restore drill.
- If you mention quorum, you state the N/R/W triple and its strong-read condition (R+W>N).
- You name how leader election / failover happens and its blast radius.
Variants
Synchronous replication
Leader waits for at least one replica to ack before acking the client.
Zero-data-loss on leader failure. Cost: every write pays an extra network round-trip; if all sync replicas are down, writes block. Practical shape: one sync replica in the same AZ plus N async replicas elsewhere.
Pros
- +RPO = 0 for acknowledged writes
- +Safe for financial / regulatory workloads
- +Enables zero-downtime leader failover
Cons
- −Extra write latency (~1–5 ms intra-AZ)
- −If all sync replicas down, writes halt
- −Cost scales linearly with sync replica count
Choose this variant when
- Payments, ledgers, auth
- Anywhere data loss = business loss
Asynchronous replication
Leader acks immediately; replica catches up on its own clock.
Cheap and fast — writes don't pay for replication latency. On leader failure you may lose the last N milliseconds of writes that never reached the follower. Perfect for analytics and read scaling; unsafe for "must not lose".
Pros
- +Write latency unaffected
- +Cheap — no cross-AZ round-trip per write
- +Read replicas scale horizontally
Cons
- −RPO > 0 — you will lose some recent writes on failover
- −Replicas can lag arbitrarily under load
- −Stale reads are a fact of life
Choose this variant when
- Read scale-out
- Analytics replicas
- Cross-region replicas for disaster recovery
Quorum (Dynamo-style)
Write to W replicas, read from R; choose R + W > N for strong consistency.
N=5 replicas. Write waits for W=3 acks; read queries R=3 replicas and picks the latest version. 3+3>5 means every read overlaps every write by at least one replica, so stale reads are impossible.
No leader, no failover. Every node is equal; the coordinator fans out writes to N replicas and waits for W acks. Can be tuned per-query: R=1,W=N for fast reads + safe writes, or R=N,W=1 for fast writes + safe reads. R+W > N guarantees read-your-writes.
Pros
- +No leader election — survives any minority loss
- +Per-query consistency knob
- +Scales horizontally without a single-leader bottleneck
Cons
- −Conflict resolution is the app's problem
- −Complex operational mental model
- −Poor fit for strongly-consistent range queries
Choose this variant when
- Huge write throughput with relaxed consistency
- Always-writable globally-distributed stores
Worked example
Scenario: Designing the database layer for a payments service.
Data classes:
ledger— immutable double-entry records. Must be durable. RPO = 0.accounts— current balance, updated per transaction. RPO = 0.audit_logs— append-only event trail. RPO ≤ 1 min.analytics_rollups— derived, rebuildable. RPO = hours (can recompute).
Replication topology for ledger + accounts (Postgres):
- Leader in AZ-A.
- One synchronous replica in AZ-B (same region) — protects against AZ loss without data loss. Write latency +1 ms.
- Two asynchronous replicas: one in AZ-C for read scale-out, one in the DR region for multi-region disaster recovery.
- RTO < 30 s with managed failover (Patroni, RDS Multi-AZ, Aurora).
Backup topology:
- Continuous WAL archival to object storage. PITR to any second in the last 7 days.
- Daily full snapshots retained 30 days, cross-region replicated.
- Monthly restore drill: spin up a new cluster from backup, run validation queries, tear down.
Durability-vs-replication incident examples:
- Pod OOMs → failover to sync replica. Zero data loss, RTO ~20 s. Replication handles it.
- Deploy includes
DELETE FROM accounts WHERE status='inactive'with a bug → runs on every replica. Only backups save you. PITR to 5 min before deploy, reconcile the legitimate writes between then and now.
State the separation explicitly. "Replication defends against hardware failure; backups defend against software failure. You need both."
Good vs bad answer
Interviewer probe
“How do you make the database highly available?”
Weak answer
"We'll put the DB behind a load balancer and have a few replicas. If one goes down the LB routes to another."
Strong answer
"One primary with synchronous replication to a same-region replica (RPO=0, +1 ms write latency), plus async replicas for read scale-out. Automatic failover via managed service or Patroni — RTO under 30 s. That handles HA. For durability against bad code or operator error, continuous WAL archival gives point-in-time restore for 7 days; daily full snapshots cross-region for 30 days. We run a restore drill monthly so we know the RTO isn't a lie. I don't put the DB behind a general LB — writes go only to the leader, reads split based on staleness tolerance."
Why it wins: Names sync vs async replication, quantifies RPO/RTO, separates replication (HA) from backups (durability), and calls out the operational discipline of tested restores. Also corrects the LB misconception — DB writes need leader awareness, not round-robin.
When it comes up
- Whenever "high availability", "uptime", or "99.9%" enters the design
- When the interviewer asks "what happens when this database fails?"
- In multi-region design discussions — replication topology is the hardest question
- When you propose a stateful component (DB, cache, queue) — how is it replicated?
- When durability, backups, or disaster recovery is probed
Order of reveal
- 1Separate replication from durability. "Replication defends against hardware failure; backups defend against software failure. Two different problems, two different tools, both required."
- 2State RPO and RTO as numbers. "For payments: RPO = 0, RTO < 30 seconds. For analytics: RPO = 15 minutes, RTO = 1 hour. Different classes of data, different tolerances."
- 3Name the replication topology. "One leader, one synchronous replica same-region for zero data loss, two async replicas for read scaling and DR. Primary is Postgres; failover is managed by Patroni or the cloud."
- 4Call out the sync replica tradeoff explicitly. "Sync replication adds ~1 ms write latency but guarantees RPO=0. If all sync replicas are down, writes block — a feature, not a bug, for financial data."
- 5Describe the backup story. "Continuous WAL archival to S3 gives PITR for 7 days. Daily snapshots cross-region for 30 days. Monthly tested restore drill so the RTO isn't a rumour."
- 6Address read routing. "Writes go to the leader only. Read-your-writes queries go to the leader. Stale-tolerant reads (analytics, history) go to async replicas. Session-sticky reads use replication LSN tracking."
Signature phrases
- “Replication survives hardware failure; backups survive operator failure” — The single most important distinction — reveals maturity instantly.
- “RPO and RTO are numbers, not adjectives” — Forces concrete commitments instead of "highly available".
- “Sync same-region, async cross-region” — The standard senior topology in one phrase.
- “An untested backup is a rumour” — Memorable operational discipline.
- “Writes to the leader, reads split by staleness tolerance” — Correct mental model for routing.
- “R + W > N for strong quorum consistency” — Precise formula, instantly signals Dynamo-family knowledge.
Likely follow-ups
?“Walk me through failover when the leader dies. What's the timeline, and what can go wrong?”Reveal
Timeline (happy path, managed failover):
- 1t=0 Leader crashes or becomes unreachable.
- 2t=0 to t=5s Coordinator (Patroni, RDS, Aurora controller) detects missed heartbeats. Detection threshold is usually 3-5 seconds.
- 3t=5s to t=15s Coordinator fences the old leader (prevent it from accepting writes if it comes back — STONITH or lease revocation) and elects a new leader from the sync replicas. The sync replica with the highest LSN (last committed log position) wins.
- 4t=15s to t=25s Promote the chosen replica: replay any in-flight transactions, open for writes.
- 5t=25s to t=30s Update the DNS record or connection endpoint. Clients reconnect.
Total RTO: ~20-30 seconds for a well-tuned setup.
Coordinator detects missed heartbeats, fences the old leader to prevent split-brain, promotes the sync replica with the highest LSN, then swaps the connection endpoint. RTO: ~20-30s for a well-tuned setup.
What goes wrong:
- 1Split-brain — old leader comes back before fencing completes, accepts writes. Now two leaders, diverging state. Fix: fencing must be synchronous before promotion, not parallel.
- 1All sync replicas also down — you can't promote without either accepting data loss or waiting for a sync replica to return. The choice: downtime or RPO > 0. Document which your runbook prefers.
- 1Promotion fails — the chosen replica can't replay its log (disk corruption, schema mismatch). Fall back to the next candidate. RTO balloons.
- 1Clients cache old DNS — some clients cache DNS beyond the TTL. A 60-second TTL means a long tail of clients still hitting the dead leader for minutes. Use an endpoint with short TTL (30-60 s) or a connection manager that handles failover.
- 1Connection pool thundering herd — 1000 app instances reconnect simultaneously on the new leader, overwhelming it. Stagger reconnects with jittered backoff; warm connection pools.
The production rule: rehearse failover monthly in a non-prod environment. Measure your RTO. If it's not < your SLO, tune or change topology.
?“Someone runs `UPDATE users SET email=NULL` on the primary. Replication completes in seconds. Now what?”Reveal
The bad news: replication did its job perfectly. The damage is propagated to every replica. Neither sync nor async replicas help here — they all faithfully replayed the destructive write.
The recovery, in order of preference:
1. Point-in-time restore (PITR) — the right answer.
- Target: the moment just before the bad command (say, 14:32:17).
- Restore the DB to a side instance at that timestamp (most managed services: ~10-30 minutes for a TB-scale DB).
- Extract the
emailcolumn for affected users. - Forward-merge back into the production DB, preserving any legitimate updates that happened between 14:32:17 and now (use
updated_atto tie-break). - Total recovery: 30-60 minutes depending on data size and merge complexity.
2. If PITR window has expired: fall back to the latest daily snapshot (say 24 hours old). Restore, extract, merge. You lose legitimate updates between the snapshot and the bad command; reconcile from audit logs if you have them.
3. If no backups at all: reconstruction from application logs, event streams, or external systems (identity provider, CRM). This is the "we learned why backups matter" scenario.
The harder question — prevention:
- Migration review — no DML on prod without a reviewed plan, tested on staging.
- Logical backups of critical columns — append-only snapshot of email/PII columns daily, retained separately.
- Audit log of DDL/DML — who ran what, when. Required for forensics.
- Slow-down rules —
UPDATEwithout aWHEREclause blocked by default in prod. DB-level safeguards like thepg_dangerouspattern.
The interview takeaway: replication and durability defend against different failures. An answer that leans only on replication misses the most common real-world data loss scenario.
?“You have N=5, W=3, R=3 in a Dynamo-style store. Three nodes go offline. Can you read? Can you write?”Reveal
Writes: blocked. W=3 requires 3 ack'd writes; only 2 nodes are reachable. Clients see write timeouts or errors.
Reads: also blocked with strict quorum. R=3 requires 3 read responses; only 2 are reachable.
Is R + W > N still satisfied? Yes (3+3>5), so the guarantee holds — but only when you have enough nodes to form a quorum at all.
What options the system has:
1. Sloppy quorum (Dynamo, Cassandra). The coordinator writes to any 3 available nodes (even if they're not the "correct" replicas). Hinted handoff stores the write with a note to replay it to the canonical replicas when they return. Writes continue; you've traded strict quorum for availability.
- Cost: reads may not see recent sloppy-quorum writes until the hints are replayed. Read-your-writes becomes eventual.
- When: Cassandra's default behavior. Good for AP workloads.
2. Strict quorum — writes block, system is unavailable for writes. This is what Spanner-style systems do. Maintains strong consistency even under failure; trades availability.
- When: systems where losing a quorum should halt writes (financial ledgers, consensus-backed state).
3. Quorum adjustment. If you anticipate this failure shape, you can tune: N=5, W=2, R=4 gives you write-availability under 3-node failure, strong reads require 4 (rare during incidents — probably unavailable too).
The key insight the interviewer wants: quorum is a tradeoff between consistency (higher R+W) and availability (lower W or R). No setting survives "most of the cluster is down" gracefully; you must pick which property degrades first.
?“How do you handle cross-region replication? What are the real latency and consistency costs?”Reveal
Cost 1: write latency. Cross-region RTT is typically 60-150 ms. Synchronous replication across regions would turn every write into a cross-continental round trip.
- Implication: sync replication is region-local. Cross-region replication is always async.
- Recovery consequence: on total region loss, RPO > 0 — typically seconds to minutes of unreplicated writes are lost.
Cost 2: stale reads cross-region. An async replica in another region lags by the network latency + replication pipeline — typically hundreds of milliseconds to seconds under normal load, and can grow to minutes under replication lag.
- Implication: reads from the DR region may see data that's behind. Read-your-writes across regions is not free.
Cost 3: split-brain risk. If the primary region is partitioned from the DR region, both sides may think the other is dead and start accepting writes. Requires a coordinator (zookeeper, etcd) outside both regions or a human-in-the-loop for failover.
Common topologies:
1. Active-passive (single-writer). One region is the primary; others are async read-only replicas. Simple, predictable. RPO > 0 on failover. Users globally still write to the primary — cross-region write latency is unavoidable.
2. Active-active with partitioned writes. Each region is primary for its own users (e.g. US users write to us-east, EU users write to eu-west). Async replication between regions makes the other region's data queryable but not writable.
3. Active-active multi-leader. All regions accept all writes. Conflict resolution required (CRDTs, LWW, app-level). Cassandra, DynamoDB Global Tables, CockroachDB offer this. Complex.
4. Paxos/Raft across regions. Spanner-style. Each shard's leaders are in 3+ regions; commit requires a majority. Write latency ~100 ms (cross-region RTT). Strong consistency, high latency. Google Spanner is the reference.
The honest answer in an interview: cross-region replication is always async; if the prompt requires strongly consistent multi-region writes, cost it at ~100 ms write latency minimum and accept that regional failover loses data.
?“Your read replica is lagging 30 seconds behind the leader. What should you do?”Reveal
First — is this normal or abnormal?
Normal lag causes (don't panic, but monitor):
- Write burst on the leader temporarily exceeding the replica's apply rate.
- Long-running query on the replica holding locks (for systems where replicas can serve mixed workloads).
- Replica hardware weaker than the leader.
Abnormal causes (investigate immediately):
- Replica falling permanently behind — apply rate < commit rate. Replica will never catch up without intervention.
- Network partition between leader and replica.
- Corrupt WAL entry stopping replication.
Responses in order:
1. Route traffic away from the lagging replica. Either the LB / connection manager detects lag > threshold and stops routing, or you manually remove it from the read pool. Any read-your-writes or staleness-sensitive query going to this replica is wrong.
2. Diagnose the lag cause.
- Check
pg_stat_replication/ equivalent — is itstreaming(good),catchup(recovering from a gap), or stuck? - Is the replica CPU-bound? Disk I/O bound? Network saturated?
- Is a long-running query holding a lock?
3. Mitigate the symptom.
- Kill long-running read queries that are blocking apply.
- Increase replica resources if it's structurally underpowered.
- Temporarily reduce leader write rate (throttle batch jobs) if the replica is legitimately overwhelmed.
4. Decide whether to wait or rebuild.
- If the replica is behind by minutes and catching up, wait.
- If it's permanently behind (apply rate < commit rate), rebuild: either take a fresh base backup + start replication from there, or spin up a new replica and retire the lagging one.
5. Update alerting.
- Alert on replication lag > threshold (e.g., 5 seconds warning, 30 seconds page).
- Alert on apply-rate vs commit-rate divergence — catches the "falling behind" case before it becomes visible lag.
The broader lesson: async replicas have inherent lag. Design for it: route stale-tolerant reads to replicas, send RYW queries to the leader, and monitor lag as a first-class metric.
Code examples
# patroni.yml
scope: payments
name: pg-node-a
restapi:
listen: 0.0.0.0:8008
connect_address: pg-node-a:8008
etcd:
hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576 # 1 MB — cap promoted replica lag
synchronous_mode: true # require sync replica before acking
synchronous_mode_strict: true # refuse writes if no sync replica
postgresql:
use_pg_rewind: true
parameters:
wal_level: replica
synchronous_commit: on
synchronous_standby_names: 'ANY 1 (pg-node-b, pg-node-c)'
archive_mode: on
archive_command: 'wal-g wal-push %p' # continuous WAL to S3 (PITR)
max_wal_senders: 10
# Monthly restore drill verifies the RTO isn't a rumour.
# wal-g backup-push /var/lib/postgresql/data → daily full snapshots
# wal-g backup-fetch /tmp/restore LATEST → tested recovery pathimport psycopg2
from dataclasses import dataclass
@dataclass
class PgPool:
leader: psycopg2.extensions.connection
replicas: list # [(conn, region, last_seen_lsn)]
def current_lsn(conn) -> int:
with conn.cursor() as c:
c.execute("SELECT pg_current_wal_lsn()::pg_lsn - '0/0'::pg_lsn")
return c.fetchone()[0]
def pick_read(pool: PgPool, max_lag_bytes: int = 1_048_576):
"""Return a connection whose replication lag fits the staleness budget."""
leader_lsn = current_lsn(pool.leader)
for conn, region, _ in pool.replicas:
with conn.cursor() as c:
c.execute("SELECT pg_last_wal_replay_lsn()::pg_lsn - '0/0'::pg_lsn")
replica_lsn = c.fetchone()[0]
if leader_lsn - replica_lsn <= max_lag_bytes:
return conn # fresh enough
return pool.leader # fall back to leader for RYW correctness
# Usage:
# reads after a write → pick_read(pool, max_lag_bytes=0) forces leader
# analytics reads → pick_read(pool, max_lag_bytes=10 * 1_048_576)Common mistakes
Replication copies all state to replicas, including the bad state. A DROP TABLE on the leader replicates to every follower in seconds. Backups with point-in-time restore are the only defence against logical errors.
Hardware failure: replicas save you (leader dies, promote sync). Software failure: only backups save you (bad DELETE replicates in seconds). You need both — they defend against different classes of failure.
"Highly available" is a marketing word. Name the RPO (how much can we lose?) and RTO (how long are we down?) per data class. If you can't, the design hasn't been made.
Routing a read-your-writes query to an async replica returns stale data. Either send those reads to the leader, or use per-session read consistency (e.g. Postgres's replication slot + LSN comparison).
Two leaders accept conflicting writes; later they reconcile. "The last one wins" loses data silently. Either pick CRDTs, or accept that multi-leader is for workloads where conflicts are rare and resolvable (e.g. per-user state partitioned by user).
Backups you've never restored are liabilities, not assets. Schedule a monthly drill: spin up from backup, run validation, measure RTO. Discover corruption or process gaps in peacetime.
Practice drills
Your leader dies in AZ-A. Sync replica is in AZ-B, async replicas in AZ-C and DR region. What's the failover sequence?Reveal
(1) Coordinator detects leader failure (missed heartbeats > threshold). (2) Promote AZ-B sync replica — it has the last committed write (RPO=0). (3) Re-point the DNS or connection string; open writes on the new leader. (4) AZ-C async and DR async re-subscribe to the new leader. (5) Provision a new replica in AZ-A to restore the 3-replica state. RTO target < 30 s. If the sync replica is also unreachable, promote an async replica and accept RPO > 0 — note the data-loss window in the incident report.
Interviewer: "your PITR window is 7 days. Someone deletes important rows 9 days ago but nobody noticed till now. What now?"Reveal
PITR is gone — we're past the window. Fall back to the daily full snapshots (retained 30 days): restore the relevant snapshot to a side cluster, identify the affected rows, and forward-merge them into the live DB with care around causality. Lesson: the PITR window is a business decision, not a technical one. If 7 days isn't enough for your failure-detection culture, extend it — or add logical append-only archival of critical tables.
A quorum system has N=5, W=3, R=3. Three of the five nodes just went offline. Can the system accept writes?Reveal
No — W=3 and only 2 nodes are up. Writes block until at least 3 nodes are reachable. Reads with R=3 also block. This is the availability cost of strong quorum. Some systems let you relax (sloppy quorum / hinted handoff) — writes get temporarily stored elsewhere and replayed when nodes recover. Cite whether your chosen system offers that and whether you accept the consistency trade-off.
Cheat sheet
- •Replication = survive hardware failure. Backup = survive software/operator failure. Both required.
- •Always state RPO (data loss tolerance) and RTO (downtime tolerance) as numbers.
- •Sync replica = RPO 0, +1–5 ms write latency. Put one same-region.
- •Async replicas = read scale + DR. Cheap but lag is real.
- •Quorum: R + W > N ⇒ strong consistency. Tune per query.
- •PITR (point-in-time restore) > daily snapshots. Archive WAL continuously.
- •Rehearse restores monthly. An untested backup is a rumour.