Reading path · 6 stops · ~95 min

Weak on reliability

Every senior round pressure-tests "what fails". This path builds a per-component failure vocabulary and the HA playbook to match.

For: Engineers whose feedback cites "hand-wavy on failure" or "no DR story"

After this path

Name a failure mode for each component, a mitigation for each, and an availability target + topology that matches.

1
Skill
Failure mode analysis
Systems don't fail because you didn't think they could. They fail the way you failed to think about. Failure-mode analysis is structured paranoia — and interviewers grade on whether you can produce it on demand.
Why this, here: The framework. "What if this dies?" per component.
2
Skill
Replication & durability
Replication is how you survive a node death; durability is how you survive a bad deploy. Candidates confuse the two and end up with a design that's highly available but cheerfully corrupt.
Why this, here: Quorum math — what actually survives N failures.
3
Skill
consensus-leader-election
Raft and Paxos aren't trivia — they're the reason your leader-election design either works or deadlocks. Most interview failures here are: "we'll elect a leader somehow" with no quorum story and no fencing.
Why this, here: When exactly-one-of-N must do the thing. Raft / etcd, not DIY.
Checkpoint
Rehearsal: for a system that needs exactly-one-of-N to do a scheduled job, which primitive do you reach for and why? If your answer is “I’ll elect a leader somehow”, re-read — the name is the point.
4
Pattern
High Availability
Redundancy + graceful degradation + operational discipline. You don't buy 99.99% — you earn it.
Why this, here: Redundancy + degradation + discipline — the three pillars.
5
Pattern
Multi-region active-passive / active-active
Geographic distribution for latency, DR, and compliance. Active-passive is operationally sane; active-active is a conflict-resolution project.
Why this, here: Active-passive vs active-active. RTO/RPO framing.
Checkpoint
Pick one: active-active or active-passive for a global URL shortener. Name the RTO, the RPO, and the one thing that breaks first when a region dies.
6
Skill
Observability & operations
You cannot operate what you cannot see; you cannot page on what you cannot measure. Candidates who design beautiful systems with no metrics, no logs, and no alerts are designing systems their on-call team will hate.
Why this, here: You can't fix what you can't see. SLOs, error budgets, the four golden signals.

Failure mode analysis

Replication & durability

consensus-leader-election

High Availability

Multi-region active-passive / active-active

Observability & operations

Failure mode analysis

Replication & durability

consensus-leader-election

High Availability

Multi-region active-passive / active-active

Observability & operations