High Availability
When to reach for this
Reach for this when…
- Availability target >= 99.95% (4 hours downtime/year or less)
- Revenue or safety impact from any downtime
- Multiple failure modes to defend against simultaneously
- Business expects the system to survive a full AZ outage
- Payment, auth, or healthcare systems where downtime = liability
Not really this pattern when…
- Internal tools where 99% uptime is acceptable
- Batch workloads with retry-all-night semantics
- Early-stage product where simplicity > availability
- Stateless preview environments torn down every night
Good vs bad answer
Interviewer probe
“Your SLO is 99.99%. How do you achieve that?”
Weak answer
"We have redundant servers and a load balancer. We also back up the database."
Strong answer
"99.99% is 52 minutes of downtime per year, so our MTTR must be under 5 minutes — which means automated detection and failover, not human SSH.
Topology: multi-AZ within a region. Dual load balancers (managed, multi-AZ). N+2 stateless app instances across 3 AZs. Quorum-replicated PostgreSQL cluster (Patroni) with synchronous replication within region — leader election completes in <10 seconds.
Every dependency has three things: a timeout (tuned to its latency budget), a circuit breaker (50% failure rate over 30s opens for 60s), and a documented degradation path. Search down shows 'search unavailable.' Cache down falls back to DB with higher latency. Analytics down buffers to local disk.
Blast-radius containment: bulkhead thread pools per dependency. A slow recommendation engine cannot consume threads needed for checkout.
Operational discipline: weekly game days exercise one failure mode (instance kill, AZ blackout, dependency latency injection). Quarterly full-failover drill in production. Error budget tracked monthly — 4.38 minutes/month. When budget is burned, feature freeze until reliability improves.
Control plane is itself multi-AZ: etcd cluster for service discovery, Vault for secrets, both 3-node Raft across AZs."
Why it wins: Names specific techniques at every layer, calls out control-plane and MTTR as the real limiters, treats HA as ongoing discipline (game days, error budgets) not one-time setup.
Cheat sheet
- •Pick a target number. Each nine costs ~10x.
- •Serial dependencies multiply failure rates. Parallel redundancy multiplies uptime.
- •Redundancy at every layer: LB, app, data, cache, control plane.
- •Multi-AZ for 99.95%. Multi-region for 99.99%+.
- •Every dependency: timeout + circuit breaker + degradation path.
- •Liveness = process alive (simple). Readiness = can serve traffic (check deps).
- •Bulkheads isolate failure. Circuit breakers prevent cascading.
- •Cell architecture: independent cells, blast radius = one cell.
- •Untested failover = no failover. Drill quarterly.
- •Error budgets turn availability into a negotiable resource.
- •Graceful degradation: shed features, keep the core.
- •Chaos engineering: break it on purpose before it breaks by accident.
Core concept
What high availability really means
High availability (HA) is not a feature you toggle on — it is the disciplined composition of redundancy, failure detection, graceful degradation, and blast-radius containment. A system that merely "has two servers" is not highly available; it is slightly less fragile.
Two LBs front N app servers spread across AZs. Primary DB with synchronous replicas and automatic failover. Health checks at every hop.
Availability math — the nines
Availability is expressed as a percentage of uptime per year:
| Target | Downtime / year | Downtime / month | Practical meaning |
|---|---|---|---|
| 99% ("two nines") | 3.65 days | 7.3 hours | Internal tools |
| 99.9% ("three nines") | 8.76 hours | 43.8 min | Standard SaaS |
| 99.95% | 4.38 hours | 21.9 min | Business-critical |
| 99.99% ("four nines") | 52.6 min | 4.38 min | Payments, auth |
| 99.999% ("five nines") | 5.26 min | 26.3 sec | Telecom, 911 |
Each additional nine costs roughly 10x more in infrastructure and operational discipline. The gap between 99.9% and 99.99% is not one server — it is automation, multi-AZ, < 5-minute MTTR, and tested failover.
Serial composition: dependencies called in sequence multiply their failure rates. Three 99.9% services in series cap you at 99.9%^3 = 99.7%. Every sync dependency is a downtime multiplier. To improve: parallelise calls, add caching short-circuits, or accept the math and set your SLO below the chain.
Parallel composition: redundant instances in parallel multiply their uptime. Two 99.9% nodes in parallel yield 1 - (0.001 x 0.001) = 99.9999%. This is the engine behind HA — put two of everything and the combined availability soars.
Redundancy at every layer
HA requires eliminating every single point of failure (SPOF):
- Load balancers: dual LBs with health-checked failover (AWS NLB is multi-AZ by default; on-prem use VRRP/keepalived pairs).
- App tier: N+2 instances across AZs. Stateless so any instance can serve any request. Session data in Redis or a cookie, never pinned to a process.
- Data tier: quorum-replicated primaries (3-node Raft/Paxos clusters), synchronous or semi-synchronous replicas across AZs. Automated leader election.
- Cache tier: Redis Sentinel or Cluster with automatic failover. Cache loss is a performance event, not a correctness event — the system must survive a cold cache.
- Control plane: DNS, config store (Consul, etcd), service discovery — these must be multi-AZ too. A single-AZ control plane is a hidden SPOF that kills the entire fleet.
Failure detection — health checks
Your system can only fail over as fast as it can detect failure. Two kinds of probes:
- Liveness: "is the process alive?" Checks that the HTTP server responds to /healthz. Failure = restart the container.
- Readiness: "can this instance serve traffic?" Checks downstream dependencies (DB reachable, cache warm, recent deployment healthy). Failure = remove from LB pool but don't kill.
Health-check intervals, thresholds, and timeouts must be tuned: too aggressive = flapping; too slow = long detection time = long outage.
Graceful degradation
When a non-critical dependency fails, the system should continue with reduced functionality rather than failing entirely. Build a degradation matrix mapping every dependency to what the user sees when it is down:
- Search down → show "search temporarily unavailable," keep the rest of the site running.
- Recommendation engine down → show popular/trending instead of personalised.
- Analytics pipeline down → buffer events to local disk, backfill when recovered.
- Payment processor slow → fail the checkout with a clear retry message; never hang.
This matrix must be designed in advance, not improvised during an incident.
Blast-radius containment
Even with redundancy, a correlated failure (bad deployment, poison message, cascade) can take down everything. Containment strategies:
- Cell architecture: partition traffic into independent cells. Each cell has its own app + DB + cache. A failure in Cell 1 affects only its shard; Cells 2-N continue. AWS uses this for Route 53, S3, and DynamoDB.
- Bulkheads: isolate thread pools / connection pools per dependency. Service A failing exhausts only its pool; Service B continues on its own pool.
- Circuit breakers: after N failures to a dependency in window W, the breaker opens and fails fast for cooldown period C. Half-open state probes recovery. Prevents cascading failures.
Chaos engineering — trust but verify
Untested HA is wishful thinking. Chaos engineering proactively injects failures to validate that redundancy, failover, and degradation work as designed:
- Instance kill: terminate a random production instance. Does the LB route around it? How fast?
- AZ failure: block all traffic to one AZ. Does the app survive? Does the DB failover?
- Dependency failure: inject latency or errors into a downstream call. Does the circuit breaker open? Does graceful degradation activate?
- Clock skew / network partition: simulate split-brain. Does the quorum hold?
Netflix Chaos Monkey pioneered this; AWS Fault Injection Simulator and Gremlin are the modern tools. The goal is to find weaknesses in staging before customers find them in production.
Canonical examples
- →Payment processing (Stripe, Square)
- →Authentication / identity services
- →Healthcare electronic records
- →Emergency dispatch / 911 routing
- →Stock exchange order matching
Variants
Single server (no HA)
One server, one DB. Any failure = total outage. Where every system starts.
One server, one database. Any single failure causes total outage. Simplest possible deployment.
The simplest deployment: a single application server talking to a single database. There are no load balancers, no replicas, no health checks — every component is a single point of failure.
When this is fine: prototypes, internal tools with < 100 users, hobby projects, anything where "restart the server" is an acceptable incident response.
Why it breaks: a server reboot, a bad deployment, a DB disk full, a network blip — any single event causes total downtime. There is no automatic recovery. MTTR is "however long it takes a human to SSH in and fix it," which at 3 AM is measured in hours.
Availability math: if the server is 99.5% and the DB is 99.5%, combined availability is 99.5% x 99.5% = 99.0% — almost four days of downtime per year. Acceptable for a team wiki; disqualifying for anything customer-facing.
Upgrading from here: the first step is always separating the app from the DB (so you can restart one without the other) and adding a health-checked load balancer in front of at least two app instances. This moves you to v2.
Pros
- +Simplest to build and operate
- +Lowest cost
- +No distributed-systems complexity
Cons
- −Any single failure = total outage
- −MTTR depends entirely on human response time
- −Cannot survive deployments without downtime
Choose this variant when
- Prototype or MVP
- Internal tool with < 100 users
- Budget for infrastructure is near zero
Redundant app tier
LB → N app servers → primary DB + read replica. App tier survives instance loss; DB is still SPOF.
Load balancer fans out to N app servers. DB is still a single point of failure but app tier survives instance loss.
Add a load balancer and multiple app instances. The app tier is now redundant — losing one instance does not cause an outage. The DB has a read replica for read scaling and a warm standby for manual failover.
What changes: the load balancer performs health checks every 10-30 seconds. An unhealthy instance is removed from the pool within one check interval. The remaining instances absorb the traffic. Deployments can use rolling updates with zero downtime.
What remains fragile: the database. The primary is still a SPOF. If it crashes, the replica can be promoted manually, but this takes 5-15 minutes and may lose the last few seconds of writes (async replication lag). The load balancer itself may be a SPOF if using a single instance (use a managed LB or VRRP pair).
Availability math: app tier with 3 instances at 99.5% each in parallel: 1 - (0.005)^3 = 99.999987%. But the DB at 99.5% caps the system at 99.5%. The weakest link dominates.
Upgrading from here: add synchronous replication and automatic DB failover (PostgreSQL Patroni, MySQL Group Replication, or managed RDS Multi-AZ). Add a second load balancer. This moves you to v3.
Pros
- +App tier survives instance failures
- +Zero-downtime deployments via rolling update
- +Read replicas for read scaling
Cons
- −DB primary is still SPOF
- −LB may be single point of failure
- −Manual DB failover takes minutes
Choose this variant when
- Standard SaaS with 99.9% target
- Team wants zero-downtime deploys
- Read-heavy workload benefiting from replicas
Fully redundant
Dual LBs, N+2 apps, quorum DB, cache cluster. No SPOF anywhere.
No single point of failure. Dual LBs, N+2 app instances, quorum-replicated DB across AZs, cache cluster with failover.
Every layer has redundancy. Dual load balancers (active-passive or active-active). N+2 app instances across three AZs. Quorum-replicated database cluster with automatic leader election. Redis Sentinel or Cluster for cache failover.
Load balancers: managed cloud LBs (AWS ALB/NLB, GCP GLB) are multi-AZ by default. On-prem, use a VRRP pair with keepalived — one active, one standby, virtual IP floats between them. Health checks at the LB level detect app failures in < 30 seconds.
App tier: N+2 means you can lose two instances simultaneously and still handle peak load. Stateless design: no in-memory sessions. Deploy across at least 3 AZs so that losing an entire AZ still leaves you with > 50% capacity.
Data tier: three-node quorum cluster (Raft-based: etcd, CockroachDB; or Paxos-based: Google Spanner). Synchronous replication within a region; leader election completes in < 10 seconds on failure. RPO = 0 (no data loss within region).
Cache tier: Redis Sentinel watches the primary and promotes a replica on failure. Cache loss causes a thundering herd to the DB — pre-warm caches during failover and use request coalescing to absorb the spike.
Availability: with every component at 99.95% and all layers redundant, the combined system approaches 99.99% — 52 minutes of downtime per year. Achieving this requires not just infrastructure but operational discipline: automated runbooks, sub-5-minute MTTR, and quarterly failover drills.
Pros
- +No single point of failure
- +Automatic failover at every layer
- +99.99% achievable with operational discipline
Cons
- −2-3x infrastructure cost
- −Operational complexity requires dedicated SRE
- −Quorum writes add latency (sync replication)
Choose this variant when
- Payment / auth / healthcare systems
- 99.99% SLA requirement
- Team has SRE capacity for operational overhead
Cell architecture
Traffic partitioned into independent, self-contained cells. Blast radius limited to one cell.
Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.
Cell architecture divides the system into independent, self-contained units called cells. Each cell has its own app tier, database, cache, and message queue. A cell router at the edge hashes the request (typically by user ID or tenant ID) and routes to the correct cell.
Why cells? In a monolithic HA deployment, a correlated failure (bad config push, poison message, cascading retry storm) can take down the entire system. Cells contain the blast radius: a failure in Cell 1 affects only its shard of users. Cells 2-N continue serving their users as if nothing happened.
Cell sizing: each cell serves 5-10% of total traffic. This means a cell failure affects at most 10% of users. Cells are sized identically for operational simplicity — same instance types, same DB schema, same deploy pipeline. You scale by adding cells, not by making cells bigger.
Cell router: the router is the one shared component and must be extremely simple and highly available itself. It does one thing: hash the shard key and route. No business logic. Typically implemented as a thin L7 proxy with a hash ring. The router itself is redundant (multi-AZ, health-checked).
Deployments: cells enable safe deployments. Deploy to one cell first (canary). Monitor error rates. If clean, roll to the next cell. A bad deployment affects only the canary cell. This is fundamentally safer than deploying to the entire fleet simultaneously.
Who uses this: AWS uses cell architecture for Route 53, S3, and DynamoDB. Azure uses it for Storage. These are the most available services on the internet — not by accident, but by design.
Pros
- +Blast radius limited to one cell (5-10% of users)
- +Safe canary deployments per cell
- +Scales horizontally by adding cells
Cons
- −Cross-cell queries are expensive or impossible
- −Cell router is a shared dependency requiring extreme HA
- −Operational overhead of managing N independent stacks
Choose this variant when
- Five-nines requirement (99.999%)
- Large-scale multi-tenant SaaS
- Need to limit blast radius of bad deployments
Scaling path
Step 1 — Single everything
Ship fast with minimal infrastructure
One server, one database. Any single failure causes total outage. Simplest possible deployment.
One server, one database. Every component is a SPOF. Acceptable for prototypes and internal tools. Availability target: ~99% (3.65 days downtime/year). Bottleneck: any single failure causes total outage.
What triggers the next iteration
- Server crash = total outage
- DB failure = total outage
- Deployments require downtime
Step 2 — Redundant app tier
Survive app instance failures and deploy without downtime
Load balancer fans out to N app servers. DB is still a single point of failure but app tier survives instance loss.
Add LB + multiple app instances + DB read replica. App tier is now redundant; DB primary is still SPOF. Availability target: 99.9%. Rolling deployments eliminate downtime. Health checks detect and remove failing instances.
What triggers the next iteration
- DB primary is SPOF
- LB may be single instance
- Manual DB failover takes minutes
Step 3 — Redundant everything
Eliminate all single points of failure
No single point of failure. Dual LBs, N+2 app instances, quorum-replicated DB across AZs, cache cluster with failover.
Dual LBs, N+2 app instances across 3 AZs, quorum-replicated DB, cache cluster. No SPOF. Automatic failover at every layer. Availability target: 99.99% (52 min/year). Requires operational discipline: automated runbooks, chaos testing, quarterly failover drills.
What triggers the next iteration
- Correlated failures (bad deploy, poison message) still take everything down
- 2-3x infrastructure cost
- Operational complexity
Step 4 — Cell architecture
Contain blast radius to a fraction of users
Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.
Partition into independent cells. Each cell is self-contained. Cell router hashes shard key. Blast radius limited to one cell (5-10% of users). Canary deployments per cell. Availability target: 99.999%. Used by AWS (S3, Route 53, DynamoDB) and Azure (Storage).
What triggers the next iteration
- Cross-cell queries require federation layer
- Cell router must be extremely simple and HA
- N independent stacks to operate
Deep dives
Availability math — nines, serial, parallel
Serial: 99.9% x 99.9% = 99.8%. Parallel: 1 - (0.001 x 0.001) = 99.9999%. Serial multiplies downtime; parallel multiplies uptime.
Availability math is the foundation of every HA conversation. Two rules govern everything:
Serial: 99.9% x 99.9% = 99.8%. Parallel: 1 - (0.001 x 0.001) = 99.9999%. Serial multiplies downtime; parallel multiplies uptime.
Rule 1 — Serial composition multiplies failure rates. If Service A is 99.9% and Service B is 99.9%, and a request must go through both, the combined availability is 99.9% x 99.9% = 99.8%. Three services in series: 99.9%^3 = 99.7%. The takeaway: every synchronous dependency you add to the request path drags down your overall availability. This is why microservice architectures with deep call chains struggle with availability — each hop multiplies failure probability.
Rule 2 — Parallel composition multiplies uptime. Two instances at 99.9% each in active-active: 1 - (1 - 0.999) x (1 - 0.999) = 1 - 0.001 x 0.001 = 99.9999%. This is the engine behind redundancy — put two of everything and the combined availability soars. Three instances: 1 - (0.001)^3 = 99.9999999%.
SLA composition across dependencies. A real system calls multiple dependencies: DB, cache, message queue, third-party APIs. The overall SLA cannot exceed the product of all serial dependencies' SLAs. If your DB is 99.99% but your payment processor is 99.9%, your checkout flow is at most 99.9% x 99.99% = 99.89%.
Implications for design:
- Minimise the number of synchronous dependencies in the critical path.
- Where possible, make calls parallel (fan-out) rather than serial (chain).
- Use caching to short-circuit dependencies: a cache hit avoids calling the dependency entirely, removing its failure rate from the calculation.
- For non-critical features, use async processing: if analytics fails, the user doesn't notice.
- Set your SLO honestly: if your dependency chain yields 99.7%, don't promise 99.9%. Either improve the chain or adjust the promise.
Error budgets (Google SRE). If your SLO is 99.95%, you have a budget of 0.05% — about 22 minutes/month. When you're within budget, ship features fast. When you're burning through it, freeze deployments and focus on reliability. This turns availability from a vague aspiration into a measurable resource that product and engineering negotiate over.
Health checks — liveness vs readiness
LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.
Health checks are the nervous system of high availability. Without them, your load balancer sends traffic to dead instances, your orchestrator doesn't restart crashed containers, and your failover never triggers.
LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.
Liveness probes answer: "is this process alive?" A simple HTTP 200 from /healthz. If the liveness check fails, the orchestrator kills and restarts the container. Use this to detect deadlocked processes, OOM-killed workers, and stuck event loops. Keep liveness checks simple — they should not call external dependencies. A liveness check that queries the database will kill healthy instances when the database is slow.
Readiness probes answer: "can this instance serve traffic?" A readiness check verifies that the instance has completed startup, its caches are warm, its connection pools are established, and its downstream dependencies are reachable. If readiness fails, the load balancer removes the instance from the pool but does NOT restart it. This is critical for graceful deploys: a new instance that's still warming up should not receive traffic until ready.
Tuning parameters:
- Interval: how often the probe fires. 10s is a good default. Too frequent = noise; too infrequent = slow detection.
- Timeout: how long to wait for a response. 3-5s. A timeout is a failure.
- Failure threshold: how many consecutive failures before action. 3 is typical. This prevents flapping on transient network blips.
- Success threshold: how many consecutive successes before marking healthy. 1 for liveness, 2-3 for readiness.
Deep health checks go beyond HTTP 200: they verify the DB connection is alive, the cache is reachable, and the last heartbeat from a critical dependency was recent. Use these for readiness probes only — never for liveness. A deep health check that fails because Redis is slow should remove the instance from the LB pool, not restart it.
Health check cascading: if Service A's readiness check calls Service B's readiness check, and Service B checks Service C, a failure at C cascades up and takes all services out of the LB pool. Design readiness checks to check only direct dependencies, not transitive ones.
Circuit breakers — fail fast, recover gracefully
Calls flow through in closed state. After N failures, breaker opens and fails fast. After cooldown, half-open lets one probe through to test recovery.
A circuit breaker prevents a failing dependency from dragging down the entire system. Without one, a slow or failing downstream causes threads/connections to pile up in the caller, eventually exhausting resources and cascading the failure upstream.
Calls flow through in closed state. After N failures, breaker opens and fails fast. After cooldown, half-open lets one probe through to test recovery.
Three states:
- 1Closed (normal): all requests pass through to the dependency. The breaker monitors failure rate. If the failure rate exceeds a threshold (e.g., 50% failures over a 30-second window), the breaker trips to Open.
- 1Open (failing fast): all requests immediately fail (or return a fallback response) without calling the dependency. This gives the downstream time to recover. After a cooldown period (e.g., 60 seconds), the breaker transitions to Half-Open.
- 1Half-Open (probing): the breaker allows one (or a few) probe requests through. If they succeed, the breaker resets to Closed. If they fail, it goes back to Open for another cooldown period.
Tuning the breaker:
- Failure rate threshold: 50% over 30 seconds is a reasonable starting point. Too low = false trips on normal error rates; too high = slow to detect real failures.
- Cooldown period: 30-60 seconds. Long enough for the downstream to recover; short enough that you're not failing fast for minutes after the issue resolves.
- Half-open probe count: 1-3 requests. Enough to get a signal; not so many that you flood a recovering dependency.
- Sliding window: use a rolling window (not fixed) to avoid edge-boundary artifacts.
Fallback strategies when the breaker is open:
- Return cached data (stale but available).
- Return a degraded response ("recommendations unavailable, showing popular items").
- Queue the request for retry when the breaker closes.
- Return a clear error with retry guidance.
Implementation: libraries like resilience4j (Java), Polly (.NET), and opossum (Node.js) provide battle-tested circuit breakers. In service meshes like Istio, circuit breaking is configured at the infrastructure level via DestinationRules.
Per-dependency breakers: each downstream dependency gets its own circuit breaker with independently tuned thresholds. A slow payment processor should not trip the breaker for the recommendation engine.
Graceful degradation — shed features, keep the core
Thread pools / connection pools are isolated per dependency. Service A failing exhausts only its own pool; Service B continues serving on its dedicated pool.
Graceful degradation is the principle that when a non-critical component fails, the system continues operating with reduced functionality rather than failing entirely. It requires advance planning — you must decide what to shed before the incident happens.
Thread pools / connection pools are isolated per dependency. Service A failing exhausts only its own pool; Service B continues serving on its dedicated pool.
The degradation matrix: for every dependency, document what the user experience should be when it's unavailable:
| Dependency | Normal behaviour | Degraded behaviour |
|---|---|---|
| Search service | Full-text search | Show "search unavailable" banner |
| Recommendation engine | Personalised feed | Show trending/popular |
| Analytics pipeline | Real-time tracking | Buffer to local disk, backfill |
| Image processing | Generate thumbnails | Serve original image (larger) |
| Payment processor | Instant checkout | "Try again in 1 minute" message |
| Notification service | Push + email | Queue for later delivery |
Load shedding under pressure: when the system is under extreme load and cannot serve all requests at full quality, shed non-critical work:
- 1Priority tiers: classify requests into tiers. Tier 1 (checkout, login) always served. Tier 2 (search, browse) served when capacity allows. Tier 3 (analytics, recommendations) shed first.
- 2Feature flags: use feature flags to disable expensive features in real-time. "Disable personalised recommendations" reduces CPU load by 30% and lets the core survive.
- 3Response quality: serve lower-quality responses: smaller images, fewer results, no auto-complete. Each reduction frees capacity for more requests.
Implementation patterns:
- Fallback decorators: wrap each dependency call with a fallback that returns cached/default data when the call fails or times out.
- Bulkheads: isolate thread pools per dependency. The recommendation service getting slow should not consume threads needed for checkout.
- Feature flags: runtime toggles that disable non-critical features without a deployment.
- Read-only mode: if the write path fails, switch to read-only. Users can browse but not purchase. Better than nothing.
The key interview signal: candidates who proactively build the degradation matrix, rather than waiting to be asked "what if X fails," demonstrate senior-level thinking. The matrix should be a design artifact, not an afterthought.
Cell architecture — blast-radius containment at scale
Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.
Cell architecture is the ultimate blast-radius containment pattern. Instead of running one large system where a correlated failure takes down everyone, you partition into N independent cells, each serving a fraction of traffic.
Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.
How it works:
- 1Cell router at the edge receives every request. It extracts a shard key (user ID, tenant ID, or account ID) and hashes it to determine which cell should handle the request.
- 2Each cell is independent: its own app tier, its own database, its own cache, its own message queue. No shared state between cells except the routing table.
- 3Cell failure is contained: if Cell 3 has a bad deployment, a DB failure, or a cascading retry storm, only Cell 3's users are affected. Cells 1, 2, and 4-N continue serving normally.
Cell sizing and operations:
- Size each cell identically. Same instance types, same DB schema, same configuration. This simplifies operations enormously — you operate one cell template, not N snowflakes.
- Each cell serves 5-10% of traffic. This means a cell failure affects at most 10% of users. For five-nines availability across the system, each individual cell only needs 99.99% availability.
- Scale by adding cells, not by making cells bigger. This avoids the "blast radius grows with scale" problem.
Deployment safety:
Deploy to one cell first (canary cell). Monitor error rates, latency, and business metrics for 15-30 minutes. If clean, roll to the next batch of cells. If not, roll back only the canary cell — 95% of users never saw the bad code.
Cross-cell concerns:
The hard part of cell architecture is cross-cell queries. If a user in Cell 1 wants to see data about a user in Cell 3, the request must be routed cross-cell. Strategies:
- Avoid cross-cell reads by routing all data for a user/tenant to the same cell.
- Federation layer for the rare cross-cell queries (admin dashboards, global search).
- Async replication of aggregated data to a global read store for analytics.
Real-world usage: AWS Route 53 uses cells. Each cell handles DNS queries for a subset of hosted zones. A cell failure affects only those zones. This is how Route 53 achieves 100% SLA — the only AWS service with that guarantee.
Chaos engineering — proactive failure injection
LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.
Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions. Put simply: break things on purpose, before they break by accident.
LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.
The chaos engineering cycle:
- 1Hypothesise: "If we kill a random app instance, the LB should route around it within 30 seconds and error rate should not spike above 0.1%."
- 2Inject failure: kill the instance (or inject latency, corrupt a response, block a network path).
- 3Observe: measure error rate, latency, and user impact.
- 4Learn: did the system behave as hypothesised? If not, fix the gap and re-test.
Types of failure injection:
| Injection | What it tests | Tool |
|---|---|---|
| Instance kill | LB health checks, auto-scaling | Chaos Monkey |
| AZ blackout | Multi-AZ failover, data replication | AWS FIS |
| Latency injection | Timeouts, circuit breakers | Gremlin, Toxiproxy |
| DNS failure | DNS caching, fallback resolution | Custom iptables |
| Dependency error | Circuit breaker, graceful degradation | Istio fault injection |
| Clock skew | Certificate validation, token expiry | Chronos |
| Disk full | Log rotation, data storage failover | Custom scripts |
Principles:
- Start small. First chaos experiment: kill one instance in staging. Build confidence before touching production.
- Minimise blast radius. Use canary/cell architecture to limit impact. Netflix runs Chaos Monkey only during business hours when engineers are watching.
- Automate. Manual chaos testing doesn't scale. Schedule experiments to run continuously. AWS FIS and Gremlin support automated experiment plans.
- Measure everything. Without observability, chaos engineering is just breaking things. You need real-time dashboards showing error rates, latency percentiles, and business metrics.
Game days: quarterly exercises where the team simulates a major failure (region outage, DB corruption, DDoS) and practices the incident response. Game days validate not just the technology but the runbooks, communication channels, and escalation paths. If your last game day was "we discovered our runbook was wrong," that's a success — you found it before a real incident.
The interview signal: mentioning chaos engineering shows operational maturity. Saying "we'd run chaos experiments to validate our HA claims" is worth more than describing perfect architecture that's never been tested.
Case studies
Cell architecture in Route 53, S3, and DynamoDB
AWS is the canonical example of cell architecture at hyperscale. Route 53 (DNS) partitions hosted zones into independent cells. Each cell handles DNS resolution for its subset of zones. A cell failure affects only those zones — not the entire DNS service. This is how Route 53 offers a 100% SLA, the only AWS service with that guarantee.
S3 uses a similar pattern: each bucket is assigned to a partition (cell) that handles all operations for that bucket. A partition failure causes errors for the affected buckets but other buckets continue serving normally. DynamoDB partitions tables into cells by partition key. A hot partition affects only its cell, not the entire table.
Key design decisions: cells are sized small enough that a cell failure is a minor event (< 1% of traffic). The cell router is the simplest possible component — it maps a key to a cell and forwards. No business logic in the router. Cells share nothing except the routing table, which is itself replicated and cached.
Numbers: Route 53 handles 100B+ DNS queries per day. S3 stores 100+ trillion objects. DynamoDB serves 89 trillion requests per day. These numbers are only possible because cell architecture limits the blast radius of any single failure.
Takeaway
Cell architecture enables hyperscale availability by limiting blast radius. Size cells small, keep the router simple, share nothing between cells.
Chaos engineering and the Simian Army
Netflix pioneered chaos engineering with Chaos Monkey (2011) — a tool that randomly kills production instances during business hours. The hypothesis: if our architecture is truly resilient, killing any single instance should have zero user impact. When an instance death causes a blip, Netflix fixes the gap and the system gets stronger.
The Simian Army expanded the concept: Latency Monkey injects delays into inter-service calls (testing timeouts and circuit breakers). Conformity Monkey checks that instances follow best practices (no root SSH keys, correct security groups). Chaos Kong simulates an entire AWS region going offline (testing multi-region failover).
Key insight: Netflix runs chaos experiments in production, not just staging. Staging environments are never realistic enough — they have different traffic patterns, different data volumes, and different configurations. Production chaos with blast-radius controls (cell architecture, traffic percentage limits) is the only way to validate HA for real.
Results: Netflix achieves ~99.99% availability for its streaming service — serving 260M+ subscribers across 190+ countries. When AWS us-east-1 experienced a major outage in 2017, Netflix continued streaming from other regions with minimal impact.
Tooling evolution: Netflix open-sourced Chaos Monkey and later contributed to the Chaos Engineering community. Modern successors include AWS Fault Injection Simulator (FIS), Gremlin, LitmusChaos (Kubernetes-native), and Steadybit.
Takeaway
Chaos engineering in production — not staging — is the only way to validate HA claims. Start with instance kills, graduate to region-level failures. Fix every gap you find.
SRE error budgets and availability targets
Google's Site Reliability Engineering (SRE) introduced the concept of error budgets — turning availability from a vague goal into a measurable, negotiable resource. The key insight: 100% availability is the wrong target. It's infinitely expensive and prevents any change (every deployment is a risk).
How error budgets work: if the SLO is 99.95%, the error budget is 0.05% — about 22 minutes of downtime per month. Product teams can "spend" this budget on deployments, experiments, and migrations. If the budget is healthy (few incidents), ship features aggressively. If the budget is burned (too many incidents), freeze features and focus on reliability.
Tiered SLOs: Google assigns different SLOs to different services based on user impact:
- Search: 99.99% (revenue impact is seconds)
- Gmail: 99.95% (users tolerate brief delays)
- Internal tools: 99.9% (employees can wait)
SLI selection matters: the Service Level Indicator must match what users care about. For a web service, SLI = "proportion of requests that return successfully within 300ms." For a batch pipeline, SLI = "proportion of jobs that complete within the deadline." A bad SLI (e.g., measuring CPU utilization instead of user-facing latency) makes the error budget meaningless.
Blameless post-mortems: when an incident burns error budget, Google conducts a blameless post-mortem focused on systemic fixes, not individual fault. Every post-mortem produces action items: better monitoring, improved runbooks, architecture changes. This culture of continuous improvement is what sustains high availability over years.
Numbers: Google operates 20+ data centres worldwide, serves 8.5B+ searches per day, and maintains 99.99%+ availability on its core services. This is achieved not through perfect technology but through the SRE framework: error budgets, SLOs, toil reduction, and blameless post-mortems.
Takeaway
Error budgets turn availability into a negotiable resource. Set SLOs by user impact, measure with user-facing SLIs, and use blameless post-mortems to continuously improve.
Decision levers
Availability target (nines)
Pick a concrete number: 99.9% / 99.95% / 99.99% / 99.999%. Each additional nine is roughly 10x the cost in infrastructure and operational discipline. 99.99% (52 min/year) demands multi-AZ, automated failover, < 5 min MTTR, and weekly chaos exercises. 99.999% (5.26 min/year) requires cell architecture and multi-region. Don't promise nines you can't measure — if you don't have SLIs tracking user-facing availability, your SLO is fiction.
Redundancy topology
Multi-AZ is the baseline for 99.9-99.95%. Multi-region for 99.99%+. Active-passive for DR with minutes-level failover. Active-active for zero-RTO but with conflict resolution complexity. Cell architecture for blast-radius containment at scale. The topology must cover every layer: LBs, app tier, data tier, cache, control plane. "Redundant app + single DB" is half-HA.
Graceful degradation strategy
Build a degradation matrix: for each dependency, define the user experience when it's down. This is a design decision, not an afterthought. Search down = show banner. Cache miss = fall through to DB (higher latency). Recommendations down = show popular items. Payment processor slow = show clear retry message. Feature flags enable runtime degradation without deployment.
Failure detection speed
Your MTTR is bounded by your detection time. Health check interval x failure threshold = detection time. With 10s interval and 3-failure threshold, detection takes 30 seconds. Add promotion time (5-10s for DB failover) and DNS TTL (30-60s). Total failover time: 1-2 minutes. To achieve < 1 min failover, reduce health check intervals (5s), lower TTLs, and use connection draining rather than hard cutover.
Blast-radius containment
Bulkheads isolate thread/connection pools per dependency. Circuit breakers fail fast and prevent cascading failures. Cell architecture limits the impact of correlated failures to one cell. The question is not whether to contain blast radius but how granularly: per-dependency (bulkhead), per-circuit (breaker), or per-shard (cell).
Failure modes
The claim "we'll fail over to the standby" has never been exercised. When the day comes, configs have drifted, credentials expired, replication broke silently. Drill quarterly or your HA is theoretical.
One slow dependency drags every request into its slowness. Without timeouts, threads pile up and the caller dies too. Every remote call needs a timeout tuned to the request's latency budget.
Downstream degrades -> upstream retries aggressively -> downstream collapses under retry load. Fix: exponential backoff with jitter, circuit breakers, and load-shedding at the edge. Never retry at every layer — pick one layer to retry.
Auth, config, or service-discovery running on a single-AZ control plane. The entire fleet dies when the control plane's AZ goes down. Replicate the control plane across AZs too.
"We have 3 replicas" — all in one AZ. Surviving an AZ outage requires distribution ACROSS AZs, not more instances within one. Check your AZ spread, not just your instance count.
Cache fails over -> all requests hit DB simultaneously -> DB collapses. Fix: stagger cache warming, use request coalescing (singleflight), and have circuit breakers on the DB path.
DNS TTL of 300s means clients continue hitting a dead endpoint for 5 minutes after failover. For HA systems, use TTLs of 30-60 seconds. Balance between failover speed and DNS infrastructure load.
Decision table
HA topology decision matrix
| Dimension | Single server | Redundant app | Fully redundant | Cell architecture |
|---|---|---|---|---|
| Target SLO | 99% | 99.9% | 99.99% | 99.999% |
| Annual downtime | 3.65 days | 8.76 hours | 52.6 min | 5.26 min |
| SPOF count | Everything | DB primary | None | Cell router only |
| Infra cost multiplier | 1x | 2x | 3x | 5-10x |
| Failover | Manual SSH | Manual DB promote | Automated | Automated per-cell |
| Blast radius | 100% | 100% (DB fail) | 100% (correlated) | 5-10% per cell |
| Team requirement | One dev | Small team | Dedicated SRE | SRE + platform team |
| Deployment safety | Downtime deploy | Rolling update | Blue-green | Cell-by-cell canary |
- Each nine is roughly 10x the infrastructure and operational cost.
- Cell architecture achieves five nines by making each cell independently four nines.
Worked example
Worked example: Payment processing system HA
A payment processing service handles 50,000 transactions per minute. The business requires 99.99% availability — 52 minutes of downtime per year — because every minute of outage means lost revenue and customer trust.
Two LBs front N app servers spread across AZs. Primary DB with synchronous replicas and automatic failover. Health checks at every hop.
Step 1 — Define the availability budget
99.99% = 4.38 minutes of downtime per month. This means MTTR must be under 3 minutes (leaving margin for detection). Human-driven failover is too slow; everything must be automated.
Step 2 — Eliminate every SPOF
Load balancers: AWS NLB (multi-AZ, no single instance to fail). DNS with 30-second TTL pointing to the NLB.
App tier: 6 stateless instances (N+2) across 3 AZs. Each AZ has 2 instances. Losing an entire AZ leaves 4 instances, still above peak capacity. Session state in Redis, not in process memory. Health checks every 10s with 3-failure threshold = 30s detection.
Database: PostgreSQL with Patroni (Raft-based leader election). 3-node cluster across 3 AZs. Synchronous replication to one standby (RPO = 0). Leader failure triggers automatic election in < 10 seconds. Connection pooler (PgBouncer) per AZ to handle connection storms during failover.
Cache: Redis Sentinel cluster (3 sentinels, 1 primary + 2 replicas). Sentinel promotes replica on primary failure. The app survives a cold cache — queries hit the DB directly with higher latency but correct results. Request coalescing (singleflight) prevents thundering herd.
Step 3 — Circuit breakers and degradation
Every external dependency has its own circuit breaker:
| Dependency | Timeout | Breaker threshold | Degradation |
|---|---|---|---|
| Card network (Visa/MC) | 5s | 40% over 20s | Queue for retry in 60s |
| Fraud detection | 2s | 50% over 30s | Allow with flag for async review |
| Notification service | 3s | 50% over 30s | Queue notification for later |
| Analytics pipeline | 1s | Any failure | Buffer to local disk |
Fraud detection degradation is a business decision: when the fraud service is down, do you block all transactions (safe but revenue-killing) or allow them with a flag for async review (risky but revenue-preserving)? We chose the latter because the fraud rate is < 0.1% and async review catches 95% of fraud within 5 minutes.
Step 4 — Blast-radius containment
Bulkhead pools: the card-network HTTP client has its own thread pool (20 threads). If Visa is slow, only those 20 threads block. The remaining thread pool (80 threads) continues serving fraud checks, notifications, and health checks.
Idempotency: every transaction has a client-generated idempotency key. Retries are safe — the system detects duplicate keys and returns the original response. This is non-negotiable for payments: a retry must never double-charge a customer.
Step 5 — Operational discipline
Weekly game days: Week 1: kill a random app instance. Week 2: inject 3s latency into the card network. Week 3: block traffic to one AZ. Week 4: fail the Redis primary.
Quarterly region failover: simulate a full AZ outage in production. Validate that the remaining 2 AZs absorb load, DB fails over, and error rate stays below 0.01%.
Error budget tracking: dashboard shows remaining budget (4.38 min/month). When budget drops below 50%, pager alerts on-call SRE. When budget is exhausted, feature freeze until reliability improves.
Post-incident reviews: every incident that burns > 30 seconds of error budget gets a blameless post-mortem. Action items are tracked to completion. Recurring themes drive architecture investments (e.g., "we keep having DB failover issues" → invest in Patroni tuning and automated testing).
The result
This architecture achieves 99.99% availability because: (1) no single component failure causes an outage, (2) automated detection + failover keeps MTTR under 3 minutes, (3) circuit breakers and degradation prevent cascading failures, (4) operational discipline (game days, error budgets, post-mortems) continuously validates and improves the system.
Interview playbook
When it comes up
- Prompt mentions "high availability," "99.9%," "nines," or "SLA"
- System handles payments, auth, or healthcare — downtime = liability
- Interviewer asks "what happens when X fails?"
- "Design a system that survives an AZ outage"
- Any mention of "failover," "redundancy," or "disaster recovery"
Order of reveal
- 11. Name the target. "Let me start by setting a concrete availability target. For a payment system, I'd target 99.99% — 52 minutes of downtime per year. That constrains everything else."
- 22. Availability math. "99.99% means MTTR under 5 minutes — human failover is too slow, so everything must be automated. And serial dependencies multiply failure rates: three 99.9% services in series cap us at 99.7%."
- 33. Redundancy at every layer. "Multi-AZ LBs, N+2 stateless app instances, quorum-replicated DB with automatic leader election. No single point of failure — including the control plane."
- 44. Circuit breakers + degradation. "Every dependency gets a timeout, a circuit breaker, and a documented degradation path. Search down shows a banner; cache down falls through to DB; analytics buffers to disk."
- 55. Blast-radius containment. "Bulkhead thread pools per dependency. For five-nines, cell architecture: partition into independent cells so a failure affects only one shard of users."
- 66. Failover mechanics. "Health checks every 10s, 3-failure threshold = 30s detection. DB failover via Raft election in <10s. DNS TTL 30s. Total failover time: ~1 minute."
- 77. Operational discipline. "Untested HA isn't HA. Weekly game days, quarterly region-failover drills, error budgets tracked monthly. When budget burns, feature freeze until reliability improves."
Signature phrases
- “Each nine costs 10x” — Shows you understand the economics, not just the technology.
- “Serial dependencies multiply failure rates” — Demonstrates availability math fluency.
- “Untested failover is no failover” — Signals operational maturity and chaos engineering awareness.
- “Degrade gracefully, fail fast” — Captures the two key HA strategies in four words.
- “Blast radius of one cell” — Shows knowledge of cell architecture and advanced HA patterns.
- “Error budget, not error prayer” — Signals SRE discipline and measurable reliability.
Likely follow-ups
?“How do you handle a full AZ outage?”Reveal
"App tier: N+2 across 3 AZs means losing one AZ leaves us above peak capacity. DB: quorum cluster across 3 AZs — losing one AZ means 2/3 nodes survive, which is still a quorum. Leader election if the failed AZ had the leader. Cache: Redis Sentinel promotes a replica. LB: managed LBs are multi-AZ by default. DNS TTL 30s. Total impact: a brief spike in latency as connections re-establish, but zero downtime."
?“What if your DB failover takes too long?”Reveal
"Raft-based leader election (Patroni, etcd) completes in <10 seconds. If that's too slow: (1) connection pooler (PgBouncer) queues requests during election rather than failing them; (2) read traffic continues on surviving replicas; (3) write traffic gets a brief 503 with Retry-After header. For true zero-downtime, use a multi-primary setup like CockroachDB, but accept the complexity tax."
?“How do you test your HA?”Reveal
"Four types of chaos: (1) Instance kill — weekly, automated, validates LB routing. (2) Dependency injection — inject latency/errors, validate circuit breakers. (3) AZ blackout — quarterly, block all traffic to one AZ. (4) Full region failover — quarterly in production. Every experiment has a hypothesis, a blast-radius limit, and a rollback plan. We also track MTTR from real incidents and game days."
?“How do you handle cascading failures?”Reveal
"Three layers of defence: (1) Timeouts on every remote call — a slow dependency cannot hold our threads indefinitely. (2) Circuit breakers — after N failures, fail fast and return fallback/cached response. (3) Bulkheads — isolated thread pools per dependency so one slow path can't consume all resources. And at the edge, load-shedding: if we're overwhelmed, reject low-priority traffic (Tier 3) to protect checkout and auth (Tier 1)."
?“What about data consistency during failover?”Reveal
"With synchronous replication (RPO = 0), the standby has every committed transaction. Failover promotes it with zero data loss. With async replication, there's a window of uncommitted transactions (typically < 1 second). For payments, we use synchronous replication and accept the latency cost. For less critical data (analytics, logs), async is fine — we accept brief data loss."
?“How do you decide what to degrade?”Reveal
"The degradation matrix is a design artifact, not an incident improvisation. For each dependency, we document: (1) what breaks when it's down, (2) what the user sees instead, (3) whether it's automated or requires manual intervention. Priority: protect the revenue path (checkout, login) at all costs. Shed non-critical features (recommendations, analytics) first. Feature flags let us toggle degradation in real-time without deployment."
Code snippets
import time
from enum import Enum
from threading import Lock
class State(Enum):
CLOSED = "closed"
OPEN = "open"
HALF_OPEN = "half_open"
class CircuitBreaker:
def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max=1):
self.failure_threshold = failure_threshold
self.recovery_timeout = recovery_timeout
self.half_open_max = half_open_max
self.state = State.CLOSED
self.failure_count = 0
self.last_failure_time = 0
self.half_open_calls = 0
self._lock = Lock()
def call(self, func, *args, fallback=None, **kwargs):
with self._lock:
if self.state == State.OPEN:
if time.time() - self.last_failure_time > self.recovery_timeout:
self.state = State.HALF_OPEN
self.half_open_calls = 0
else:
return fallback() if fallback else None # fail fast
if self.state == State.HALF_OPEN and self.half_open_calls >= self.half_open_max:
return fallback() if fallback else None
try:
result = func(*args, **kwargs)
with self._lock:
if self.state == State.HALF_OPEN:
self.state = State.CLOSED # recovered
self.failure_count = 0
return result
except Exception:
with self._lock:
self.failure_count += 1
self.last_failure_time = time.time()
if self.state == State.HALF_OPEN:
self.state = State.OPEN
elif self.failure_count >= self.failure_threshold:
self.state = State.OPEN
return fallback() if fallback else Nonefrom flask import Flask, jsonify
import psycopg2
import redis
app = Flask(__name__)
db_pool = psycopg2.pool.ThreadedConnectionPool(1, 10, dsn="...")
redis_client = redis.Redis(host="cache", port=6379)
@app.route("/healthz") # liveness: is the process alive?
def liveness():
return jsonify({"status": "alive"}), 200
@app.route("/readyz") # readiness: can we serve traffic?
def readiness():
checks = {}
try:
conn = db_pool.getconn()
conn.cursor().execute("SELECT 1")
db_pool.putconn(conn)
checks["db"] = "ok"
except Exception as e:
checks["db"] = str(e)
return jsonify({"status": "not_ready", "checks": checks}), 503
try:
redis_client.ping()
checks["cache"] = "ok"
except Exception as e:
checks["cache"] = str(e)
return jsonify({"status": "not_ready", "checks": checks}), 503
return jsonify({"status": "ready", "checks": checks}), 200from functools import wraps
from flask import jsonify
import logging
logger = logging.getLogger(__name__)
# Degradation matrix: dependency -> fallback behaviour
DEGRADATION_MATRIX = {
"search": lambda: {"results": [], "degraded": True, "message": "Search temporarily unavailable"},
"recommendations": lambda: {"items": get_popular_items(), "degraded": True},
"analytics": lambda: None, # silently skip
}
def with_degradation(dependency_name):
"""Decorator: if the wrapped call fails, return degraded response."""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
try:
return func(*args, **kwargs)
except Exception as e:
logger.warning(f"{dependency_name} failed: {e}, degrading gracefully")
fallback = DEGRADATION_MATRIX.get(dependency_name)
if fallback:
return fallback()
raise # no fallback defined -> propagate
return wrapper
return decorator
@with_degradation("search")
def search_products(query):
return search_service.query(query)
@with_degradation("recommendations")
def get_recommendations(user_id):
return recommendation_service.get(user_id)
def get_popular_items():
# Cached popular items, refreshed hourly
return cache.get("popular_items", default=[])
import time
import random
from functools import wraps
def retry_with_backoff(max_retries=3, base_delay=0.5, max_delay=30.0, jitter=True):
"""Retry a function with exponential backoff and optional jitter.
Jitter prevents thundering herd when many clients retry simultaneously.
Without jitter, all clients retry at exactly the same intervals,
creating periodic traffic spikes that can re-overload a recovering service.
"""
def decorator(func):
@wraps(func)
def wrapper(*args, **kwargs):
last_exception = None
for attempt in range(max_retries + 1):
try:
return func(*args, **kwargs)
except Exception as e:
last_exception = e
if attempt == max_retries:
raise
delay = min(base_delay * (2 ** attempt), max_delay)
if jitter:
delay = delay * (0.5 + random.random()) # 50-150% of delay
time.sleep(delay)
raise last_exception
return wrapper
return decorator
@retry_with_backoff(max_retries=3, base_delay=1.0)
def call_payment_processor(transaction):
return payment_api.charge(transaction)
def serial_availability(*components: float) -> float:
"""Serial: multiply availabilities. Weakest link dominates."""
result = 1.0
for a in components:
result *= a
return result
def parallel_availability(*components: float) -> float:
"""Parallel: multiply downtimes, then subtract from 1."""
downtime = 1.0
for a in components:
downtime *= (1.0 - a)
return 1.0 - downtime
def downtime_per_year(availability: float) -> str:
"""Convert availability to human-readable annual downtime."""
minutes = (1 - availability) * 365.25 * 24 * 60
if minutes < 60:
return f"{minutes:.1f} min/year"
hours = minutes / 60
if hours < 24:
return f"{hours:.1f} hours/year"
return f"{hours / 24:.1f} days/year"
# Example: three services in series, each 99.9%
serial = serial_availability(0.999, 0.999, 0.999)
print(f"Serial (3x 99.9%): {serial*100:.4f}% = {downtime_per_year(serial)}")
# -> Serial (3x 99.9%): 99.7003% = 2.6 hours/year ... wait, actually ~26 hours
# Two instances in parallel, each 99.9%
parallel = parallel_availability(0.999, 0.999)
print(f"Parallel (2x 99.9%): {parallel*100:.6f}% = {downtime_per_year(parallel)}")
# -> Parallel (2x 99.9%): 99.999900% = 0.5 min/year
# Real system: serial chain of parallel groups
app_tier = parallel_availability(0.999, 0.999, 0.999) # 3 app nodes
db_tier = parallel_availability(0.999, 0.999) # primary + replica
system = serial_availability(app_tier, db_tier)
print(f"System: {system*100:.6f}% = {downtime_per_year(system)}")
Drills
Your system has three 99.9% dependencies called in series. What is the best-case combined SLO?Reveal
99.9%^3 = 99.7% — about 26 hours of downtime per year. You cannot be more available than your dependency chain. Fixes: parallelise calls where possible, add caching to short-circuit dependencies, use async processing for non-critical calls, or accept the math and set your SLO below the chain.
An interviewer says "prove your failover works." What do you show?Reveal
Three things: (1) Game-day logs from the last quarter — timestamps, failure modes injected, MTTR measured. (2) Chaos engineering coverage: which failure modes have been injected (instance kill, AZ blackout, dependency failure) and what gaps were found. (3) Real incident MTTR metrics. If none of these exist, the honest answer is "we don't know — and that's the first thing I'd change."
What is the difference between liveness and readiness health checks?Reveal
Liveness: "is the process alive?" Simple HTTP 200 from /healthz. Failure = restart the container. Keep it simple — never check external dependencies. Readiness: "can this instance serve traffic?" Checks DB connectivity, cache warmth, deployment health. Failure = remove from LB pool but don't restart. Critical for zero-downtime deployments.
Your cache cluster fails over. What happens to the database?Reveal
Thundering herd: all cached reads suddenly hit the DB. Mitigation: (1) request coalescing (singleflight) — deduplicate concurrent requests for the same key. (2) Pre-warm the new cache primary with hot keys before promoting. (3) Circuit breaker on the DB path — if DB is overwhelmed, return degraded response. (4) Gradually shift traffic to the new primary rather than instant cutover.
How does cell architecture help with deployments?Reveal
Deploy to one cell (canary) first. Monitor error rate, latency, and business metrics for 15-30 minutes. If clean, roll to the next batch. If bad, roll back only the canary — 90-95% of users never saw the bad code. This is fundamentally safer than deploying to the entire fleet. It also limits the blast radius of bad config changes, data migrations, and feature flag rollouts.
Your SLO is 99.99% but your payment processor only guarantees 99.9%. How do you reconcile?Reveal
You cannot exceed the SLA of a serial dependency. Options: (1) Add redundancy — use two payment processors and failover between them (parallel composition). (2) Queue and retry — if the processor is down, queue the transaction and retry when it recovers (async decoupling). (3) Negotiate — pay for a higher-tier SLA from the processor. (4) Accept — set your checkout-flow SLO at 99.9% and document that the payment processor is the bottleneck.