intermediatereliability

Replication & durability

Leader/follower, sync vs async replication, write quorum, RPO/RTO.

Replication is how you survive a node death; durability is how you survive a bad deploy. Candidates confuse the two and end up with a design that's highly available but cheerfully corrupt.

Read this if your last attempt…

You said "we'll replicate the DB" without naming sync vs async
You can't state your RPO and RTO as numbers
You don't have a backup-and-restore story separate from replication
You confuse "highly available" with "durable"

The concept

Replication gives you more than one copy of the live data so a hardware or process failure doesn't take the service down. Durability is the guarantee that once you've acknowledged a write, it will survive any subsequent failure — including bugs, fat-fingered admins, and catastrophic deploys.

Different problems, different tools.

Architecture diagram· Leader/follower with sync + async replicas

Sync replica = no data loss on leader failure (RPO=0). Async replicas = scale reads cheaply. Pick the mix by cost tolerance.

Replication shapes — when each wins.

Leader/follower	Multi-leader	Leaderless quorum
Writes	Single leader	Many leaders	Any N of replicas
Failover	Promote follower	Continue — any leader alive	Transparent — no leader
Conflict handling	None (single writer)	App-level or CRDTs	Last-write-wins / vector clocks
Best for	OLTP, strong consistency	Multi-region active-active	AP stores at planet scale
Worst for	Cross-region writes	Strongly-consistent workloads	Relational queries
Example systems	Postgres, MySQL, MongoDB	CouchDB, Active-active MySQL	Cassandra, Dynamo, Riak

How interviewers grade this

You name RPO and RTO for each data class (e.g. "payments: RPO=0, RTO<30s; analytics: RPO=15min, RTO=1h").
You distinguish sync from async replicas and say why you chose the split.
You have a backup story separate from replication — PITR, off-region snapshots, tested restore drill.
If you mention quorum, you state the N/R/W triple and its strong-read condition (R+W>N).
You name how leader election / failover happens and its blast radius.

Variants

Synchronous replication

Leader waits for at least one replica to ack before acking the client.

Zero-data-loss on leader failure. Cost: every write pays an extra network round-trip; if all sync replicas are down, writes block. Practical shape: one sync replica in the same AZ plus N async replicas elsewhere.

Pros

+RPO = 0 for acknowledged writes
+Safe for financial / regulatory workloads
+Enables zero-downtime leader failover

Cons

−Extra write latency (~1–5 ms intra-AZ)
−If all sync replicas down, writes halt
−Cost scales linearly with sync replica count

Choose this variant when

Payments, ledgers, auth
Anywhere data loss = business loss

Asynchronous replication

Leader acks immediately; replica catches up on its own clock.

Cheap and fast — writes don't pay for replication latency. On leader failure you may lose the last N milliseconds of writes that never reached the follower. Perfect for analytics and read scaling; unsafe for "must not lose".

Pros

+Write latency unaffected
+Cheap — no cross-AZ round-trip per write
+Read replicas scale horizontally

Cons

−RPO > 0 — you will lose some recent writes on failover
−Replicas can lag arbitrarily under load
−Stale reads are a fact of life

Choose this variant when

Read scale-out
Analytics replicas
Cross-region replicas for disaster recovery

Quorum (Dynamo-style)

Write to W replicas, read from R; choose R + W > N for strong consistency.

Architecture diagram· Quorum: R + W > N guarantees strong consistency

N=5 replicas. Write waits for W=3 acks; read queries R=3 replicas and picks the latest version. 3+3>5 means every read overlaps every write by at least one replica, so stale reads are impossible.

No leader, no failover. Every node is equal; the coordinator fans out writes to N replicas and waits for W acks. Can be tuned per-query: R=1,W=N for fast reads + safe writes, or R=N,W=1 for fast writes + safe reads. R+W > N guarantees read-your-writes.

Pros

+No leader election — survives any minority loss
+Per-query consistency knob
+Scales horizontally without a single-leader bottleneck

Cons

−Conflict resolution is the app's problem
−Complex operational mental model
−Poor fit for strongly-consistent range queries

Choose this variant when

Huge write throughput with relaxed consistency
Always-writable globally-distributed stores

Worked example

Scenario: Designing the database layer for a payments service.

Data classes:

ledger — immutable double-entry records. Must be durable. RPO = 0.
accounts — current balance, updated per transaction. RPO = 0.
audit_logs — append-only event trail. RPO ≤ 1 min.
analytics_rollups — derived, rebuildable. RPO = hours (can recompute).

Replication topology for ledger + accounts (Postgres):

Leader in AZ-A.
One synchronous replica in AZ-B (same region) — protects against AZ loss without data loss. Write latency +1 ms.
Two asynchronous replicas: one in AZ-C for read scale-out, one in the DR region for multi-region disaster recovery.
RTO < 30 s with managed failover (Patroni, RDS Multi-AZ, Aurora).

Backup topology:

Continuous WAL archival to object storage. PITR to any second in the last 7 days.
Daily full snapshots retained 30 days, cross-region replicated.
Monthly restore drill: spin up a new cluster from backup, run validation queries, tear down.

Durability-vs-replication incident examples:

Pod OOMs → failover to sync replica. Zero data loss, RTO ~20 s. Replication handles it.
Deploy includes DELETE FROM accounts WHERE status='inactive' with a bug → runs on every replica. Only backups save you. PITR to 5 min before deploy, reconcile the legitimate writes between then and now.

State the separation explicitly. "Replication defends against hardware failure; backups defend against software failure. You need both."

Good vs bad answer

Interviewer probe

“How do you make the database highly available?”

Weak answer

"We'll put the DB behind a load balancer and have a few replicas. If one goes down the LB routes to another."

Strong answer

"One primary with synchronous replication to a same-region replica (RPO=0, +1 ms write latency), plus async replicas for read scale-out. Automatic failover via managed service or Patroni — RTO under 30 s. That handles HA. For durability against bad code or operator error, continuous WAL archival gives point-in-time restore for 7 days; daily full snapshots cross-region for 30 days. We run a restore drill monthly so we know the RTO isn't a lie. I don't put the DB behind a general LB — writes go only to the leader, reads split based on staleness tolerance."

Why it wins: Names sync vs async replication, quantifies RPO/RTO, separates replication (HA) from backups (durability), and calls out the operational discipline of tested restores. Also corrects the LB misconception — DB writes need leader awareness, not round-robin.

Interview playbook2 minutes in HLD when database layer is introduced; additional 2-3 minutes in deep-dive on durability, failover, or multi-region.

When it comes up

Whenever "high availability", "uptime", or "99.9%" enters the design
When the interviewer asks "what happens when this database fails?"
In multi-region design discussions — replication topology is the hardest question
When you propose a stateful component (DB, cache, queue) — how is it replicated?
When durability, backups, or disaster recovery is probed

Order of reveal

1
Separate replication from durability. "Replication defends against hardware failure; backups defend against software failure. Two different problems, two different tools, both required."
2
State RPO and RTO as numbers. "For payments: RPO = 0, RTO < 30 seconds. For analytics: RPO = 15 minutes, RTO = 1 hour. Different classes of data, different tolerances."
3
Name the replication topology. "One leader, one synchronous replica same-region for zero data loss, two async replicas for read scaling and DR. Primary is Postgres; failover is managed by Patroni or the cloud."
4
Call out the sync replica tradeoff explicitly. "Sync replication adds ~1 ms write latency but guarantees RPO=0. If all sync replicas are down, writes block — a feature, not a bug, for financial data."
5
Describe the backup story. "Continuous WAL archival to S3 gives PITR for 7 days. Daily snapshots cross-region for 30 days. Monthly tested restore drill so the RTO isn't a rumour."
6
Address read routing. "Writes go to the leader only. Read-your-writes queries go to the leader. Stale-tolerant reads (analytics, history) go to async replicas. Session-sticky reads use replication LSN tracking."

Signature phrases

“Replication survives hardware failure; backups survive operator failure”

“RPO and RTO are numbers, not adjectives”

“Sync same-region, async cross-region”

“An untested backup is a rumour”

“Writes to the leader, reads split by staleness tolerance”

“R + W > N for strong quorum consistency”

“Replication survives hardware failure; backups survive operator failure” — The single most important distinction — reveals maturity instantly.
“RPO and RTO are numbers, not adjectives” — Forces concrete commitments instead of "highly available".
“Sync same-region, async cross-region” — The standard senior topology in one phrase.
“An untested backup is a rumour” — Memorable operational discipline.
“Writes to the leader, reads split by staleness tolerance” — Correct mental model for routing.
“R + W > N for strong quorum consistency” — Precise formula, instantly signals Dynamo-family knowledge.

Likely follow-ups

?“Walk me through failover when the leader dies. What's the timeline, and what can go wrong?”Reveal

Timeline (happy path, managed failover):

1t=0 Leader crashes or becomes unreachable.
2t=0 to t=5s Coordinator (Patroni, RDS, Aurora controller) detects missed heartbeats. Detection threshold is usually 3-5 seconds.
3t=5s to t=15s Coordinator fences the old leader (prevent it from accepting writes if it comes back — STONITH or lease revocation) and elects a new leader from the sync replicas. The sync replica with the highest LSN (last committed log position) wins.
4t=15s to t=25s Promote the chosen replica: replay any in-flight transactions, open for writes.
5t=25s to t=30s Update the DNS record or connection endpoint. Clients reconnect.

Total RTO: ~20-30 seconds for a well-tuned setup.

Architecture diagram· Failover timeline — detect, fence, promote, redirect

Coordinator detects missed heartbeats, fences the old leader to prevent split-brain, promotes the sync replica with the highest LSN, then swaps the connection endpoint. RTO: ~20-30s for a well-tuned setup.

What goes wrong:

1Split-brain — old leader comes back before fencing completes, accepts writes. Now two leaders, diverging state. Fix: fencing must be synchronous before promotion, not parallel.

1All sync replicas also down — you can't promote without either accepting data loss or waiting for a sync replica to return. The choice: downtime or RPO > 0. Document which your runbook prefers.

1Promotion fails — the chosen replica can't replay its log (disk corruption, schema mismatch). Fall back to the next candidate. RTO balloons.

1Clients cache old DNS — some clients cache DNS beyond the TTL. A 60-second TTL means a long tail of clients still hitting the dead leader for minutes. Use an endpoint with short TTL (30-60 s) or a connection manager that handles failover.

1Connection pool thundering herd — 1000 app instances reconnect simultaneously on the new leader, overwhelming it. Stagger reconnects with jittered backoff; warm connection pools.

The production rule: rehearse failover monthly in a non-prod environment. Measure your RTO. If it's not < your SLO, tune or change topology.

?“Someone runs `UPDATE users SET email=NULL` on the primary. Replication completes in seconds. Now what?”Reveal

The bad news: replication did its job perfectly. The damage is propagated to every replica. Neither sync nor async replicas help here — they all faithfully replayed the destructive write.

The recovery, in order of preference:

1. Point-in-time restore (PITR) — the right answer.

Target: the moment just before the bad command (say, 14:32:17).
Restore the DB to a side instance at that timestamp (most managed services: ~10-30 minutes for a TB-scale DB).
Extract the email column for affected users.
Forward-merge back into the production DB, preserving any legitimate updates that happened between 14:32:17 and now (use updated_at to tie-break).
Total recovery: 30-60 minutes depending on data size and merge complexity.

2. If PITR window has expired: fall back to the latest daily snapshot (say 24 hours old). Restore, extract, merge. You lose legitimate updates between the snapshot and the bad command; reconcile from audit logs if you have them.

3. If no backups at all: reconstruction from application logs, event streams, or external systems (identity provider, CRM). This is the "we learned why backups matter" scenario.

The harder question — prevention:

Migration review — no DML on prod without a reviewed plan, tested on staging.
Logical backups of critical columns — append-only snapshot of email/PII columns daily, retained separately.
Audit log of DDL/DML — who ran what, when. Required for forensics.
Slow-down rules — UPDATE without a WHERE clause blocked by default in prod. DB-level safeguards like the pg_dangerous pattern.

The interview takeaway: replication and durability defend against different failures. An answer that leans only on replication misses the most common real-world data loss scenario.

?“You have N=5, W=3, R=3 in a Dynamo-style store. Three nodes go offline. Can you read? Can you write?”Reveal

Writes: blocked. W=3 requires 3 ack'd writes; only 2 nodes are reachable. Clients see write timeouts or errors.

Reads: also blocked with strict quorum. R=3 requires 3 read responses; only 2 are reachable.

Is R + W > N still satisfied? Yes (3+3>5), so the guarantee holds — but only when you have enough nodes to form a quorum at all.

What options the system has:

1. Sloppy quorum (Dynamo, Cassandra). The coordinator writes to any 3 available nodes (even if they're not the "correct" replicas). Hinted handoff stores the write with a note to replay it to the canonical replicas when they return. Writes continue; you've traded strict quorum for availability.

Cost: reads may not see recent sloppy-quorum writes until the hints are replayed. Read-your-writes becomes eventual.
When: Cassandra's default behavior. Good for AP workloads.

2. Strict quorum — writes block, system is unavailable for writes. This is what Spanner-style systems do. Maintains strong consistency even under failure; trades availability.

When: systems where losing a quorum should halt writes (financial ledgers, consensus-backed state).

3. Quorum adjustment. If you anticipate this failure shape, you can tune: N=5, W=2, R=4 gives you write-availability under 3-node failure, strong reads require 4 (rare during incidents — probably unavailable too).

The key insight the interviewer wants: quorum is a tradeoff between consistency (higher R+W) and availability (lower W or R). No setting survives "most of the cluster is down" gracefully; you must pick which property degrades first.

?“How do you handle cross-region replication? What are the real latency and consistency costs?”Reveal

Cost 1: write latency. Cross-region RTT is typically 60-150 ms. Synchronous replication across regions would turn every write into a cross-continental round trip.

Implication: sync replication is region-local. Cross-region replication is always async.
Recovery consequence: on total region loss, RPO > 0 — typically seconds to minutes of unreplicated writes are lost.

Cost 2: stale reads cross-region. An async replica in another region lags by the network latency + replication pipeline — typically hundreds of milliseconds to seconds under normal load, and can grow to minutes under replication lag.

Implication: reads from the DR region may see data that's behind. Read-your-writes across regions is not free.

Cost 3: split-brain risk. If the primary region is partitioned from the DR region, both sides may think the other is dead and start accepting writes. Requires a coordinator (zookeeper, etcd) outside both regions or a human-in-the-loop for failover.

Common topologies:

1. Active-passive (single-writer). One region is the primary; others are async read-only replicas. Simple, predictable. RPO > 0 on failover. Users globally still write to the primary — cross-region write latency is unavoidable.

2. Active-active with partitioned writes. Each region is primary for its own users (e.g. US users write to us-east, EU users write to eu-west). Async replication between regions makes the other region's data queryable but not writable.

3. Active-active multi-leader. All regions accept all writes. Conflict resolution required (CRDTs, LWW, app-level). Cassandra, DynamoDB Global Tables, CockroachDB offer this. Complex.

4. Paxos/Raft across regions. Spanner-style. Each shard's leaders are in 3+ regions; commit requires a majority. Write latency ~100 ms (cross-region RTT). Strong consistency, high latency. Google Spanner is the reference.

The honest answer in an interview: cross-region replication is always async; if the prompt requires strongly consistent multi-region writes, cost it at ~100 ms write latency minimum and accept that regional failover loses data.

?“Your read replica is lagging 30 seconds behind the leader. What should you do?”Reveal

First — is this normal or abnormal?

Normal lag causes (don't panic, but monitor):

Write burst on the leader temporarily exceeding the replica's apply rate.
Long-running query on the replica holding locks (for systems where replicas can serve mixed workloads).
Replica hardware weaker than the leader.

Abnormal causes (investigate immediately):

Replica falling permanently behind — apply rate < commit rate. Replica will never catch up without intervention.
Network partition between leader and replica.
Corrupt WAL entry stopping replication.

Responses in order:

1. Route traffic away from the lagging replica. Either the LB / connection manager detects lag > threshold and stops routing, or you manually remove it from the read pool. Any read-your-writes or staleness-sensitive query going to this replica is wrong.

2. Diagnose the lag cause.

Check pg_stat_replication / equivalent — is it streaming (good), catchup (recovering from a gap), or stuck?
Is the replica CPU-bound? Disk I/O bound? Network saturated?
Is a long-running query holding a lock?

3. Mitigate the symptom.

Kill long-running read queries that are blocking apply.
Increase replica resources if it's structurally underpowered.
Temporarily reduce leader write rate (throttle batch jobs) if the replica is legitimately overwhelmed.

4. Decide whether to wait or rebuild.

If the replica is behind by minutes and catching up, wait.
If it's permanently behind (apply rate < commit rate), rebuild: either take a fresh base backup + start replication from there, or spin up a new replica and retire the lagging one.

5. Update alerting.

Alert on replication lag > threshold (e.g., 5 seconds warning, 30 seconds page).
Alert on apply-rate vs commit-rate divergence — catches the "falling behind" case before it becomes visible lag.

The broader lesson: async replicas have inherent lag. Design for it: route stale-tolerant reads to replicas, send RYW queries to the leader, and monitor lag as a first-class metric.

Code examples

yamlPostgres + Patroni — synchronous commit with automatic failover

# patroni.yml
scope: payments
name: pg-node-a

restapi:
  listen: 0.0.0.0:8008
  connect_address: pg-node-a:8008

etcd:
  hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576   # 1 MB — cap promoted replica lag
    synchronous_mode: true              # require sync replica before acking
    synchronous_mode_strict: true       # refuse writes if no sync replica
    postgresql:
      use_pg_rewind: true
      parameters:
        wal_level: replica
        synchronous_commit: on
        synchronous_standby_names: 'ANY 1 (pg-node-b, pg-node-c)'
        archive_mode: on
        archive_command: 'wal-g wal-push %p'   # continuous WAL to S3 (PITR)
        max_wal_senders: 10

# Monthly restore drill verifies the RTO isn't a rumour.
# wal-g backup-push /var/lib/postgresql/data   → daily full snapshots
# wal-g backup-fetch /tmp/restore LATEST        → tested recovery path

pythonRoute reads by staleness tolerance using replication LSN

import psycopg2
from dataclasses import dataclass

@dataclass
class PgPool:
    leader: psycopg2.extensions.connection
    replicas: list                    # [(conn, region, last_seen_lsn)]

def current_lsn(conn) -> int:
    with conn.cursor() as c:
        c.execute("SELECT pg_current_wal_lsn()::pg_lsn - '0/0'::pg_lsn")
        return c.fetchone()[0]

def pick_read(pool: PgPool, max_lag_bytes: int = 1_048_576):
    """Return a connection whose replication lag fits the staleness budget."""
    leader_lsn = current_lsn(pool.leader)
    for conn, region, _ in pool.replicas:
        with conn.cursor() as c:
            c.execute("SELECT pg_last_wal_replay_lsn()::pg_lsn - '0/0'::pg_lsn")
            replica_lsn = c.fetchone()[0]
        if leader_lsn - replica_lsn <= max_lag_bytes:
            return conn          # fresh enough
    return pool.leader           # fall back to leader for RYW correctness

# Usage:
#   reads after a write → pick_read(pool, max_lag_bytes=0) forces leader
#   analytics reads     → pick_read(pool, max_lag_bytes=10 * 1_048_576)

Common mistakes

Confusing replication with backup

Replication copies all state to replicas, including the bad state. A DROP TABLE on the leader replicates to every follower in seconds. Backups with point-in-time restore are the only defence against logical errors.

Architecture diagram· Replication vs backup — different failures, different tools

Hardware failure: replicas save you (leader dies, promote sync). Software failure: only backups save you (bad DELETE replicates in seconds). You need both — they defend against different classes of failure.

Stating "HA" without numbers

"Highly available" is a marketing word. Name the RPO (how much can we lose?) and RTO (how long are we down?) per data class. If you can't, the design hasn't been made.

Async replicas treated as strongly consistent

Routing a read-your-writes query to an async replica returns stale data. Either send those reads to the leader, or use per-session read consistency (e.g. Postgres's replication slot + LSN comparison).

Multi-leader without a conflict planAdvanced

Two leaders accept conflicting writes; later they reconcile. "The last one wins" loses data silently. Either pick CRDTs, or accept that multi-leader is for workloads where conflicts are rare and resolvable (e.g. per-user state partitioned by user).

Never rehearsing restoresAdvanced

Backups you've never restored are liabilities, not assets. Schedule a monthly drill: spin up from backup, run validation, measure RTO. Discover corruption or process gaps in peacetime.

Practice drills

Your leader dies in AZ-A. Sync replica is in AZ-B, async replicas in AZ-C and DR region. What's the failover sequence?Reveal

(1) Coordinator detects leader failure (missed heartbeats > threshold). (2) Promote AZ-B sync replica — it has the last committed write (RPO=0). (3) Re-point the DNS or connection string; open writes on the new leader. (4) AZ-C async and DR async re-subscribe to the new leader. (5) Provision a new replica in AZ-A to restore the 3-replica state. RTO target < 30 s. If the sync replica is also unreachable, promote an async replica and accept RPO > 0 — note the data-loss window in the incident report.

Interviewer: "your PITR window is 7 days. Someone deletes important rows 9 days ago but nobody noticed till now. What now?"Reveal

PITR is gone — we're past the window. Fall back to the daily full snapshots (retained 30 days): restore the relevant snapshot to a side cluster, identify the affected rows, and forward-merge them into the live DB with care around causality. Lesson: the PITR window is a business decision, not a technical one. If 7 days isn't enough for your failure-detection culture, extend it — or add logical append-only archival of critical tables.

A quorum system has N=5, W=3, R=3. Three of the five nodes just went offline. Can the system accept writes?Reveal

No — W=3 and only 2 nodes are up. Writes block until at least 3 nodes are reachable. Reads with R=3 also block. This is the availability cost of strong quorum. Some systems let you relax (sloppy quorum / hinted handoff) — writes get temporarily stored elsewhere and replayed when nodes recover. Cite whether your chosen system offers that and whether you accept the consistency trade-off.

Cheat sheet

•Replication = survive hardware failure. Backup = survive software/operator failure. Both required.
•Always state RPO (data loss tolerance) and RTO (downtime tolerance) as numbers.
•Sync replica = RPO 0, +1–5 ms write latency. Put one same-region.
•Async replicas = read scale + DR. Cheap but lag is real.
•Quorum: R + W > N ⇒ strong consistency. Tune per query.
•PITR (point-in-time restore) > daily snapshots. Archive WAL continuously.
•Rehearse restores monthly. An untested backup is a rumour.

8% complete

Current

Read this if

Step 1 of 13

The concept

Jump to next

Leader/follower

Multi-leader

Leaderless quorum

Writes

Single leader

Many leaders

Any N of replicas

Failover

Promote follower

Continue — any leader alive

Transparent — no leader

Conflict handling

None (single writer)

App-level or CRDTs

Last-write-wins / vector clocks

Best for

OLTP, strong consistency

Multi-region active-active

AP stores at planet scale

Worst for

Cross-region writes

Strongly-consistent workloads

Relational queries

Example systems

Postgres, MySQL, MongoDB

CouchDB, Active-active MySQL

Cassandra, Dynamo, Riak

# patroni.yml scope: payments name: pg-node-a restapi: listen: 0.0.0.0:8008 connect_address: pg-node-a:8008 etcd: hosts: etcd-1:2379,etcd-2:2379,etcd-3:2379 bootstrap: dcs: ttl: 30 loop_wait: 10 retry_timeout: 10 maximum_lag_on_failover: 1048576 # 1 MB — cap promoted replica lag synchronous_mode: true # require sync replica before acking synchronous_mode_strict: true # refuse writes if no sync replica postgresql: use_pg_rewind: true parameters: wal_level: replica synchronous_commit: on synchronous_standby_names: 'ANY 1 (pg-node-b, pg-node-c)' archive_mode: on archive_command: 'wal-g wal-push %p' # continuous WAL to S3 (PITR) max_wal_senders: 10 # Monthly restore drill verifies the RTO isn't a rumour. # wal-g backup-push /var/lib/postgresql/data → daily full snapshots # wal-g backup-fetch /tmp/restore LATEST → tested recovery path

import psycopg2 from dataclasses import dataclass @dataclass class PgPool: leader: psycopg2.extensions.connection replicas: list # [(conn, region, last_seen_lsn)] def current_lsn(conn) -> int: with conn.cursor() as c: c.execute("SELECT pg_current_wal_lsn()::pg_lsn - '0/0'::pg_lsn") return c.fetchone()[0] def pick_read(pool: PgPool, max_lag_bytes: int = 1_048_576): """Return a connection whose replication lag fits the staleness budget.""" leader_lsn = current_lsn(pool.leader) for conn, region, _ in pool.replicas: with conn.cursor() as c: c.execute("SELECT pg_last_wal_replay_lsn()::pg_lsn - '0/0'::pg_lsn") replica_lsn = c.fetchone()[0] if leader_lsn - replica_lsn <= max_lag_bytes: return conn # fresh enough return pool.leader # fall back to leader for RYW correctness # Usage: # reads after a write → pick_read(pool, max_lag_bytes=0) forces leader # analytics reads → pick_read(pool, max_lag_bytes=10 * 1_048_576)