Pattern

High Availability

When to reach for this

Reach for this when…

Availability target >= 99.95% (4 hours downtime/year or less)
Revenue or safety impact from any downtime
Multiple failure modes to defend against simultaneously
Business expects the system to survive a full AZ outage
Payment, auth, or healthcare systems where downtime = liability

Not really this pattern when…

Internal tools where 99% uptime is acceptable
Batch workloads with retry-all-night semantics
Early-stage product where simplicity > availability
Stateless preview environments torn down every night

Good vs bad answer

Interviewer probe

“Your SLO is 99.99%. How do you achieve that?”

Weak answer

"We have redundant servers and a load balancer. We also back up the database."

Strong answer

"99.99% is 52 minutes of downtime per year, so our MTTR must be under 5 minutes — which means automated detection and failover, not human SSH.

Topology: multi-AZ within a region. Dual load balancers (managed, multi-AZ). N+2 stateless app instances across 3 AZs. Quorum-replicated PostgreSQL cluster (Patroni) with synchronous replication within region — leader election completes in <10 seconds.

Every dependency has three things: a timeout (tuned to its latency budget), a circuit breaker (50% failure rate over 30s opens for 60s), and a documented degradation path. Search down shows 'search unavailable.' Cache down falls back to DB with higher latency. Analytics down buffers to local disk.

Blast-radius containment: bulkhead thread pools per dependency. A slow recommendation engine cannot consume threads needed for checkout.

Operational discipline: weekly game days exercise one failure mode (instance kill, AZ blackout, dependency latency injection). Quarterly full-failover drill in production. Error budget tracked monthly — 4.38 minutes/month. When budget is burned, feature freeze until reliability improves.

Control plane is itself multi-AZ: etcd cluster for service discovery, Vault for secrets, both 3-node Raft across AZs."

Why it wins: Names specific techniques at every layer, calls out control-plane and MTTR as the real limiters, treats HA as ongoing discipline (game days, error budgets) not one-time setup.

Cheat sheet

•Pick a target number. Each nine costs ~10x.
•Serial dependencies multiply failure rates. Parallel redundancy multiplies uptime.
•Redundancy at every layer: LB, app, data, cache, control plane.
•Multi-AZ for 99.95%. Multi-region for 99.99%+.
•Every dependency: timeout + circuit breaker + degradation path.
•Liveness = process alive (simple). Readiness = can serve traffic (check deps).
•Bulkheads isolate failure. Circuit breakers prevent cascading.
•Cell architecture: independent cells, blast radius = one cell.
•Untested failover = no failover. Drill quarterly.
•Error budgets turn availability into a negotiable resource.
•Graceful degradation: shed features, keep the core.
•Chaos engineering: break it on purpose before it breaks by accident.

Core concept

What high availability really means

High availability (HA) is not a feature you toggle on — it is the disciplined composition of redundancy, failure detection, graceful degradation, and blast-radius containment. A system that merely "has two servers" is not highly available; it is slightly less fragile.

Architecture diagram· Redundancy at every tier — no single point of failure

Two LBs front N app servers spread across AZs. Primary DB with synchronous replicas and automatic failover. Health checks at every hop.

Availability math — the nines

Availability is expressed as a percentage of uptime per year:

Target	Downtime / year	Downtime / month	Practical meaning
99% ("two nines")	3.65 days	7.3 hours	Internal tools
99.9% ("three nines")	8.76 hours	43.8 min	Standard SaaS
99.95%	4.38 hours	21.9 min	Business-critical
99.99% ("four nines")	52.6 min	4.38 min	Payments, auth
99.999% ("five nines")	5.26 min	26.3 sec	Telecom, 911

Each additional nine costs roughly 10x more in infrastructure and operational discipline. The gap between 99.9% and 99.99% is not one server — it is automation, multi-AZ, < 5-minute MTTR, and tested failover.

Serial composition: dependencies called in sequence multiply their failure rates. Three 99.9% services in series cap you at 99.9%^3 = 99.7%. Every sync dependency is a downtime multiplier. To improve: parallelise calls, add caching short-circuits, or accept the math and set your SLO below the chain.

Parallel composition: redundant instances in parallel multiply their uptime. Two 99.9% nodes in parallel yield 1 - (0.001 x 0.001) = 99.9999%. This is the engine behind HA — put two of everything and the combined availability soars.

Redundancy at every layer

HA requires eliminating every single point of failure (SPOF):

Load balancers: dual LBs with health-checked failover (AWS NLB is multi-AZ by default; on-prem use VRRP/keepalived pairs).
App tier: N+2 instances across AZs. Stateless so any instance can serve any request. Session data in Redis or a cookie, never pinned to a process.
Data tier: quorum-replicated primaries (3-node Raft/Paxos clusters), synchronous or semi-synchronous replicas across AZs. Automated leader election.
Cache tier: Redis Sentinel or Cluster with automatic failover. Cache loss is a performance event, not a correctness event — the system must survive a cold cache.
Control plane: DNS, config store (Consul, etcd), service discovery — these must be multi-AZ too. A single-AZ control plane is a hidden SPOF that kills the entire fleet.

Failure detection — health checks

Your system can only fail over as fast as it can detect failure. Two kinds of probes:

Liveness: "is the process alive?" Checks that the HTTP server responds to /healthz. Failure = restart the container.
Readiness: "can this instance serve traffic?" Checks downstream dependencies (DB reachable, cache warm, recent deployment healthy). Failure = remove from LB pool but don't kill.

Health-check intervals, thresholds, and timeouts must be tuned: too aggressive = flapping; too slow = long detection time = long outage.

Graceful degradation

When a non-critical dependency fails, the system should continue with reduced functionality rather than failing entirely. Build a degradation matrix mapping every dependency to what the user sees when it is down:

Search down → show "search temporarily unavailable," keep the rest of the site running.
Recommendation engine down → show popular/trending instead of personalised.
Analytics pipeline down → buffer events to local disk, backfill when recovered.
Payment processor slow → fail the checkout with a clear retry message; never hang.

This matrix must be designed in advance, not improvised during an incident.

Blast-radius containment

Even with redundancy, a correlated failure (bad deployment, poison message, cascade) can take down everything. Containment strategies:

Cell architecture: partition traffic into independent cells. Each cell has its own app + DB + cache. A failure in Cell 1 affects only its shard; Cells 2-N continue. AWS uses this for Route 53, S3, and DynamoDB.
Bulkheads: isolate thread pools / connection pools per dependency. Service A failing exhausts only its pool; Service B continues on its own pool.
Circuit breakers: after N failures to a dependency in window W, the breaker opens and fails fast for cooldown period C. Half-open state probes recovery. Prevents cascading failures.

Chaos engineering — trust but verify

Untested HA is wishful thinking. Chaos engineering proactively injects failures to validate that redundancy, failover, and degradation work as designed:

Instance kill: terminate a random production instance. Does the LB route around it? How fast?
AZ failure: block all traffic to one AZ. Does the app survive? Does the DB failover?
Dependency failure: inject latency or errors into a downstream call. Does the circuit breaker open? Does graceful degradation activate?
Clock skew / network partition: simulate split-brain. Does the quorum hold?

Netflix Chaos Monkey pioneered this; AWS Fault Injection Simulator and Gremlin are the modern tools. The goal is to find weaknesses in staging before customers find them in production.

Canonical examples

→Payment processing (Stripe, Square)
→Authentication / identity services
→Healthcare electronic records
→Emergency dispatch / 911 routing
→Stock exchange order matching

Variants

Single server (no HA)

One server, one DB. Any failure = total outage. Where every system starts.

Architecture diagram· v1 — Single everything (SPOF at every layer)

One server, one database. Any single failure causes total outage. Simplest possible deployment.

The simplest deployment: a single application server talking to a single database. There are no load balancers, no replicas, no health checks — every component is a single point of failure.

When this is fine: prototypes, internal tools with < 100 users, hobby projects, anything where "restart the server" is an acceptable incident response.

Why it breaks: a server reboot, a bad deployment, a DB disk full, a network blip — any single event causes total downtime. There is no automatic recovery. MTTR is "however long it takes a human to SSH in and fix it," which at 3 AM is measured in hours.

Availability math: if the server is 99.5% and the DB is 99.5%, combined availability is 99.5% x 99.5% = 99.0% — almost four days of downtime per year. Acceptable for a team wiki; disqualifying for anything customer-facing.

Upgrading from here: the first step is always separating the app from the DB (so you can restart one without the other) and adding a health-checked load balancer in front of at least two app instances. This moves you to v2.

Pros

+Simplest to build and operate
+Lowest cost
+No distributed-systems complexity

Cons

−Any single failure = total outage
−MTTR depends entirely on human response time
−Cannot survive deployments without downtime

Choose this variant when

Prototype or MVP
Internal tool with < 100 users
Budget for infrastructure is near zero

Redundant app tier

LB → N app servers → primary DB + read replica. App tier survives instance loss; DB is still SPOF.

Architecture diagram· v2 — Redundant app tier, single DB

Load balancer fans out to N app servers. DB is still a single point of failure but app tier survives instance loss.

Add a load balancer and multiple app instances. The app tier is now redundant — losing one instance does not cause an outage. The DB has a read replica for read scaling and a warm standby for manual failover.

What changes: the load balancer performs health checks every 10-30 seconds. An unhealthy instance is removed from the pool within one check interval. The remaining instances absorb the traffic. Deployments can use rolling updates with zero downtime.

What remains fragile: the database. The primary is still a SPOF. If it crashes, the replica can be promoted manually, but this takes 5-15 minutes and may lose the last few seconds of writes (async replication lag). The load balancer itself may be a SPOF if using a single instance (use a managed LB or VRRP pair).

Availability math: app tier with 3 instances at 99.5% each in parallel: 1 - (0.005)^3 = 99.999987%. But the DB at 99.5% caps the system at 99.5%. The weakest link dominates.

Upgrading from here: add synchronous replication and automatic DB failover (PostgreSQL Patroni, MySQL Group Replication, or managed RDS Multi-AZ). Add a second load balancer. This moves you to v3.

Pros

+App tier survives instance failures
+Zero-downtime deployments via rolling update
+Read replicas for read scaling

Cons

−DB primary is still SPOF
−LB may be single point of failure
−Manual DB failover takes minutes

Choose this variant when

Standard SaaS with 99.9% target
Team wants zero-downtime deploys
Read-heavy workload benefiting from replicas

Fully redundant

Dual LBs, N+2 apps, quorum DB, cache cluster. No SPOF anywhere.

Architecture diagram· v3 — Redundant everything: LBs, apps, DB cluster, cache cluster

No single point of failure. Dual LBs, N+2 app instances, quorum-replicated DB across AZs, cache cluster with failover.

Every layer has redundancy. Dual load balancers (active-passive or active-active). N+2 app instances across three AZs. Quorum-replicated database cluster with automatic leader election. Redis Sentinel or Cluster for cache failover.

Load balancers: managed cloud LBs (AWS ALB/NLB, GCP GLB) are multi-AZ by default. On-prem, use a VRRP pair with keepalived — one active, one standby, virtual IP floats between them. Health checks at the LB level detect app failures in < 30 seconds.

App tier: N+2 means you can lose two instances simultaneously and still handle peak load. Stateless design: no in-memory sessions. Deploy across at least 3 AZs so that losing an entire AZ still leaves you with > 50% capacity.

Data tier: three-node quorum cluster (Raft-based: etcd, CockroachDB; or Paxos-based: Google Spanner). Synchronous replication within a region; leader election completes in < 10 seconds on failure. RPO = 0 (no data loss within region).

Cache tier: Redis Sentinel watches the primary and promotes a replica on failure. Cache loss causes a thundering herd to the DB — pre-warm caches during failover and use request coalescing to absorb the spike.

Availability: with every component at 99.95% and all layers redundant, the combined system approaches 99.99% — 52 minutes of downtime per year. Achieving this requires not just infrastructure but operational discipline: automated runbooks, sub-5-minute MTTR, and quarterly failover drills.

Pros

+No single point of failure
+Automatic failover at every layer
+99.99% achievable with operational discipline

Cons

−2-3x infrastructure cost
−Operational complexity requires dedicated SRE
−Quorum writes add latency (sync replication)

Choose this variant when

Payment / auth / healthcare systems
99.99% SLA requirement
Team has SRE capacity for operational overhead

Cell architecture

Traffic partitioned into independent, self-contained cells. Blast radius limited to one cell.

Architecture diagram· v4 — Cell architecture: independent cells behind a cell router

Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.

Cell architecture divides the system into independent, self-contained units called cells. Each cell has its own app tier, database, cache, and message queue. A cell router at the edge hashes the request (typically by user ID or tenant ID) and routes to the correct cell.

Why cells? In a monolithic HA deployment, a correlated failure (bad config push, poison message, cascading retry storm) can take down the entire system. Cells contain the blast radius: a failure in Cell 1 affects only its shard of users. Cells 2-N continue serving their users as if nothing happened.

Cell sizing: each cell serves 5-10% of total traffic. This means a cell failure affects at most 10% of users. Cells are sized identically for operational simplicity — same instance types, same DB schema, same deploy pipeline. You scale by adding cells, not by making cells bigger.

Cell router: the router is the one shared component and must be extremely simple and highly available itself. It does one thing: hash the shard key and route. No business logic. Typically implemented as a thin L7 proxy with a hash ring. The router itself is redundant (multi-AZ, health-checked).

Deployments: cells enable safe deployments. Deploy to one cell first (canary). Monitor error rates. If clean, roll to the next cell. A bad deployment affects only the canary cell. This is fundamentally safer than deploying to the entire fleet simultaneously.

Who uses this: AWS uses cell architecture for Route 53, S3, and DynamoDB. Azure uses it for Storage. These are the most available services on the internet — not by accident, but by design.

Pros

+Blast radius limited to one cell (5-10% of users)
+Safe canary deployments per cell
+Scales horizontally by adding cells

Cons

−Cross-cell queries are expensive or impossible
−Cell router is a shared dependency requiring extreme HA
−Operational overhead of managing N independent stacks

Choose this variant when

Five-nines requirement (99.999%)
Large-scale multi-tenant SaaS
Need to limit blast radius of bad deployments

Scaling path

Step 1 — Single everything

Ship fast with minimal infrastructure

Architecture diagram· v1 — Single everything (SPOF at every layer)

One server, one database. Any single failure causes total outage. Simplest possible deployment.

One server, one database. Every component is a SPOF. Acceptable for prototypes and internal tools. Availability target: ~99% (3.65 days downtime/year). Bottleneck: any single failure causes total outage.

What triggers the next iteration

Server crash = total outage
DB failure = total outage
Deployments require downtime

Step 2 — Redundant app tier

Survive app instance failures and deploy without downtime

Architecture diagram· v2 — Redundant app tier, single DB

Load balancer fans out to N app servers. DB is still a single point of failure but app tier survives instance loss.

Add LB + multiple app instances + DB read replica. App tier is now redundant; DB primary is still SPOF. Availability target: 99.9%. Rolling deployments eliminate downtime. Health checks detect and remove failing instances.

What triggers the next iteration

DB primary is SPOF
LB may be single instance
Manual DB failover takes minutes

Step 3 — Redundant everything

Eliminate all single points of failure

Architecture diagram· v3 — Redundant everything: LBs, apps, DB cluster, cache cluster

No single point of failure. Dual LBs, N+2 app instances, quorum-replicated DB across AZs, cache cluster with failover.

Dual LBs, N+2 app instances across 3 AZs, quorum-replicated DB, cache cluster. No SPOF. Automatic failover at every layer. Availability target: 99.99% (52 min/year). Requires operational discipline: automated runbooks, chaos testing, quarterly failover drills.

What triggers the next iteration

Correlated failures (bad deploy, poison message) still take everything down
2-3x infrastructure cost
Operational complexity

Step 4 — Cell architecture

Contain blast radius to a fraction of users

Architecture diagram· v4 — Cell architecture: independent cells behind a cell router

Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.

Partition into independent cells. Each cell is self-contained. Cell router hashes shard key. Blast radius limited to one cell (5-10% of users). Canary deployments per cell. Availability target: 99.999%. Used by AWS (S3, Route 53, DynamoDB) and Azure (Storage).

What triggers the next iteration

Cross-cell queries require federation layer
Cell router must be extremely simple and HA
N independent stacks to operate

Deep dives

Availability math — nines, serial, parallel

Architecture diagram· Availability math — serial vs parallel composition

Serial: 99.9% x 99.9% = 99.8%. Parallel: 1 - (0.001 x 0.001) = 99.9999%. Serial multiplies downtime; parallel multiplies uptime.

Availability math is the foundation of every HA conversation. Two rules govern everything:

Architecture diagram· Availability math — serial vs parallel composition

Serial: 99.9% x 99.9% = 99.8%. Parallel: 1 - (0.001 x 0.001) = 99.9999%. Serial multiplies downtime; parallel multiplies uptime.

Rule 1 — Serial composition multiplies failure rates. If Service A is 99.9% and Service B is 99.9%, and a request must go through both, the combined availability is 99.9% x 99.9% = 99.8%. Three services in series: 99.9%^3 = 99.7%. The takeaway: every synchronous dependency you add to the request path drags down your overall availability. This is why microservice architectures with deep call chains struggle with availability — each hop multiplies failure probability.

Rule 2 — Parallel composition multiplies uptime. Two instances at 99.9% each in active-active: 1 - (1 - 0.999) x (1 - 0.999) = 1 - 0.001 x 0.001 = 99.9999%. This is the engine behind redundancy — put two of everything and the combined availability soars. Three instances: 1 - (0.001)^3 = 99.9999999%.

SLA composition across dependencies. A real system calls multiple dependencies: DB, cache, message queue, third-party APIs. The overall SLA cannot exceed the product of all serial dependencies' SLAs. If your DB is 99.99% but your payment processor is 99.9%, your checkout flow is at most 99.9% x 99.99% = 99.89%.

Implications for design:

Minimise the number of synchronous dependencies in the critical path.
Where possible, make calls parallel (fan-out) rather than serial (chain).
Use caching to short-circuit dependencies: a cache hit avoids calling the dependency entirely, removing its failure rate from the calculation.
For non-critical features, use async processing: if analytics fails, the user doesn't notice.
Set your SLO honestly: if your dependency chain yields 99.7%, don't promise 99.9%. Either improve the chain or adjust the promise.

Error budgets (Google SRE). If your SLO is 99.95%, you have a budget of 0.05% — about 22 minutes/month. When you're within budget, ship features fast. When you're burning through it, freeze deployments and focus on reliability. This turns availability from a vague aspiration into a measurable resource that product and engineering negotiate over.

Health checks — liveness vs readiness

Architecture diagram· Health check flow — liveness vs readiness probes

LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.

Health checks are the nervous system of high availability. Without them, your load balancer sends traffic to dead instances, your orchestrator doesn't restart crashed containers, and your failover never triggers.

Architecture diagram· Health check flow — liveness vs readiness probes

LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.

Liveness probes answer: "is this process alive?" A simple HTTP 200 from /healthz. If the liveness check fails, the orchestrator kills and restarts the container. Use this to detect deadlocked processes, OOM-killed workers, and stuck event loops. Keep liveness checks simple — they should not call external dependencies. A liveness check that queries the database will kill healthy instances when the database is slow.

Readiness probes answer: "can this instance serve traffic?" A readiness check verifies that the instance has completed startup, its caches are warm, its connection pools are established, and its downstream dependencies are reachable. If readiness fails, the load balancer removes the instance from the pool but does NOT restart it. This is critical for graceful deploys: a new instance that's still warming up should not receive traffic until ready.

Tuning parameters:

Interval: how often the probe fires. 10s is a good default. Too frequent = noise; too infrequent = slow detection.
Timeout: how long to wait for a response. 3-5s. A timeout is a failure.
Failure threshold: how many consecutive failures before action. 3 is typical. This prevents flapping on transient network blips.
Success threshold: how many consecutive successes before marking healthy. 1 for liveness, 2-3 for readiness.

Deep health checks go beyond HTTP 200: they verify the DB connection is alive, the cache is reachable, and the last heartbeat from a critical dependency was recent. Use these for readiness probes only — never for liveness. A deep health check that fails because Redis is slow should remove the instance from the LB pool, not restart it.

Health check cascading: if Service A's readiness check calls Service B's readiness check, and Service B checks Service C, a failure at C cascades up and takes all services out of the LB pool. Design readiness checks to check only direct dependencies, not transitive ones.

Circuit breakers — fail fast, recover gracefully

Architecture diagram· Circuit breaker states: closed → open → half-open

Calls flow through in closed state. After N failures, breaker opens and fails fast. After cooldown, half-open lets one probe through to test recovery.

A circuit breaker prevents a failing dependency from dragging down the entire system. Without one, a slow or failing downstream causes threads/connections to pile up in the caller, eventually exhausting resources and cascading the failure upstream.

Architecture diagram· Circuit breaker states: closed → open → half-open

Calls flow through in closed state. After N failures, breaker opens and fails fast. After cooldown, half-open lets one probe through to test recovery.

Three states:

1Closed (normal): all requests pass through to the dependency. The breaker monitors failure rate. If the failure rate exceeds a threshold (e.g., 50% failures over a 30-second window), the breaker trips to Open.

1Open (failing fast): all requests immediately fail (or return a fallback response) without calling the dependency. This gives the downstream time to recover. After a cooldown period (e.g., 60 seconds), the breaker transitions to Half-Open.

1Half-Open (probing): the breaker allows one (or a few) probe requests through. If they succeed, the breaker resets to Closed. If they fail, it goes back to Open for another cooldown period.

Tuning the breaker:

Failure rate threshold: 50% over 30 seconds is a reasonable starting point. Too low = false trips on normal error rates; too high = slow to detect real failures.
Cooldown period: 30-60 seconds. Long enough for the downstream to recover; short enough that you're not failing fast for minutes after the issue resolves.
Half-open probe count: 1-3 requests. Enough to get a signal; not so many that you flood a recovering dependency.
Sliding window: use a rolling window (not fixed) to avoid edge-boundary artifacts.

Fallback strategies when the breaker is open:

Return cached data (stale but available).
Return a degraded response ("recommendations unavailable, showing popular items").
Queue the request for retry when the breaker closes.
Return a clear error with retry guidance.

Implementation: libraries like resilience4j (Java), Polly (.NET), and opossum (Node.js) provide battle-tested circuit breakers. In service meshes like Istio, circuit breaking is configured at the infrastructure level via DestinationRules.

Per-dependency breakers: each downstream dependency gets its own circuit breaker with independently tuned thresholds. A slow payment processor should not trip the breaker for the recommendation engine.

Graceful degradation — shed features, keep the core

Architecture diagram· Bulkhead isolation — Service A failure cannot cascade to Service B

Thread pools / connection pools are isolated per dependency. Service A failing exhausts only its own pool; Service B continues serving on its dedicated pool.

Graceful degradation is the principle that when a non-critical component fails, the system continues operating with reduced functionality rather than failing entirely. It requires advance planning — you must decide what to shed before the incident happens.

Architecture diagram· Bulkhead isolation — Service A failure cannot cascade to Service B

Thread pools / connection pools are isolated per dependency. Service A failing exhausts only its own pool; Service B continues serving on its dedicated pool.

The degradation matrix: for every dependency, document what the user experience should be when it's unavailable:

Dependency	Normal behaviour	Degraded behaviour
Search service	Full-text search	Show "search unavailable" banner
Recommendation engine	Personalised feed	Show trending/popular
Analytics pipeline	Real-time tracking	Buffer to local disk, backfill
Image processing	Generate thumbnails	Serve original image (larger)
Payment processor	Instant checkout	"Try again in 1 minute" message
Notification service	Push + email	Queue for later delivery

Load shedding under pressure: when the system is under extreme load and cannot serve all requests at full quality, shed non-critical work:

1Priority tiers: classify requests into tiers. Tier 1 (checkout, login) always served. Tier 2 (search, browse) served when capacity allows. Tier 3 (analytics, recommendations) shed first.
2Feature flags: use feature flags to disable expensive features in real-time. "Disable personalised recommendations" reduces CPU load by 30% and lets the core survive.
3Response quality: serve lower-quality responses: smaller images, fewer results, no auto-complete. Each reduction frees capacity for more requests.

Implementation patterns:

Fallback decorators: wrap each dependency call with a fallback that returns cached/default data when the call fails or times out.
Bulkheads: isolate thread pools per dependency. The recommendation service getting slow should not consume threads needed for checkout.
Feature flags: runtime toggles that disable non-critical features without a deployment.
Read-only mode: if the write path fails, switch to read-only. Users can browse but not purchase. Better than nothing.

The key interview signal: candidates who proactively build the degradation matrix, rather than waiting to be asked "what if X fails," demonstrate senior-level thinking. The matrix should be a design artifact, not an afterthought.

Cell architecture — blast-radius containment at scale

Architecture diagram· v4 — Cell architecture: independent cells behind a cell router

Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.

Cell architecture is the ultimate blast-radius containment pattern. Instead of running one large system where a correlated failure takes down everyone, you partition into N independent cells, each serving a fraction of traffic.

Architecture diagram· v4 — Cell architecture: independent cells behind a cell router

Traffic is partitioned by shard key into independent cells. Each cell contains its own app tier + DB + cache. A cell failure affects only its shard.

How it works:

1Cell router at the edge receives every request. It extracts a shard key (user ID, tenant ID, or account ID) and hashes it to determine which cell should handle the request.
2Each cell is independent: its own app tier, its own database, its own cache, its own message queue. No shared state between cells except the routing table.
3Cell failure is contained: if Cell 3 has a bad deployment, a DB failure, or a cascading retry storm, only Cell 3's users are affected. Cells 1, 2, and 4-N continue serving normally.

Cell sizing and operations:

Size each cell identically. Same instance types, same DB schema, same configuration. This simplifies operations enormously — you operate one cell template, not N snowflakes.
Each cell serves 5-10% of traffic. This means a cell failure affects at most 10% of users. For five-nines availability across the system, each individual cell only needs 99.99% availability.
Scale by adding cells, not by making cells bigger. This avoids the "blast radius grows with scale" problem.

Deployment safety:

Deploy to one cell first (canary cell). Monitor error rates, latency, and business metrics for 15-30 minutes. If clean, roll to the next batch of cells. If not, roll back only the canary cell — 95% of users never saw the bad code.

Cross-cell concerns:

The hard part of cell architecture is cross-cell queries. If a user in Cell 1 wants to see data about a user in Cell 3, the request must be routed cross-cell. Strategies:

Avoid cross-cell reads by routing all data for a user/tenant to the same cell.
Federation layer for the rare cross-cell queries (admin dashboards, global search).
Async replication of aggregated data to a global read store for analytics.

Real-world usage: AWS Route 53 uses cells. Each cell handles DNS queries for a subset of hosted zones. A cell failure affects only those zones. This is how Route 53 achieves 100% SLA — the only AWS service with that guarantee.

Chaos engineering — proactive failure injection

Architecture diagram· Health check flow — liveness vs readiness probes

LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.

Chaos engineering is the discipline of experimenting on a production system to build confidence in its ability to withstand turbulent conditions. Put simply: break things on purpose, before they break by accident.

Architecture diagram· Health check flow — liveness vs readiness probes

LB sends periodic health probes. Healthy servers receive traffic. Unhealthy servers are removed from the pool until they recover.

The chaos engineering cycle:

1Hypothesise: "If we kill a random app instance, the LB should route around it within 30 seconds and error rate should not spike above 0.1%."
2Inject failure: kill the instance (or inject latency, corrupt a response, block a network path).
3Observe: measure error rate, latency, and user impact.
4Learn: did the system behave as hypothesised? If not, fix the gap and re-test.

Types of failure injection:

Injection	What it tests	Tool
Instance kill	LB health checks, auto-scaling	Chaos Monkey
AZ blackout	Multi-AZ failover, data replication	AWS FIS
Latency injection	Timeouts, circuit breakers	Gremlin, Toxiproxy
DNS failure	DNS caching, fallback resolution	Custom iptables
Dependency error	Circuit breaker, graceful degradation	Istio fault injection
Clock skew	Certificate validation, token expiry	Chronos
Disk full	Log rotation, data storage failover	Custom scripts

Principles:

Start small. First chaos experiment: kill one instance in staging. Build confidence before touching production.
Minimise blast radius. Use canary/cell architecture to limit impact. Netflix runs Chaos Monkey only during business hours when engineers are watching.
Automate. Manual chaos testing doesn't scale. Schedule experiments to run continuously. AWS FIS and Gremlin support automated experiment plans.
Measure everything. Without observability, chaos engineering is just breaking things. You need real-time dashboards showing error rates, latency percentiles, and business metrics.

Game days: quarterly exercises where the team simulates a major failure (region outage, DB corruption, DDoS) and practices the incident response. Game days validate not just the technology but the runbooks, communication channels, and escalation paths. If your last game day was "we discovered our runbook was wrong," that's a success — you found it before a real incident.

The interview signal: mentioning chaos engineering shows operational maturity. Saying "we'd run chaos experiments to validate our HA claims" is worth more than describing perfect architecture that's never been tested.

Case studies

AWS

Cell architecture in Route 53, S3, and DynamoDB

AWS is the canonical example of cell architecture at hyperscale. Route 53 (DNS) partitions hosted zones into independent cells. Each cell handles DNS resolution for its subset of zones. A cell failure affects only those zones — not the entire DNS service. This is how Route 53 offers a 100% SLA, the only AWS service with that guarantee.

S3 uses a similar pattern: each bucket is assigned to a partition (cell) that handles all operations for that bucket. A partition failure causes errors for the affected buckets but other buckets continue serving normally. DynamoDB partitions tables into cells by partition key. A hot partition affects only its cell, not the entire table.

Key design decisions: cells are sized small enough that a cell failure is a minor event (< 1% of traffic). The cell router is the simplest possible component — it maps a key to a cell and forwards. No business logic in the router. Cells share nothing except the routing table, which is itself replicated and cached.

Numbers: Route 53 handles 100B+ DNS queries per day. S3 stores 100+ trillion objects. DynamoDB serves 89 trillion requests per day. These numbers are only possible because cell architecture limits the blast radius of any single failure.

Takeaway

Cell architecture enables hyperscale availability by limiting blast radius. Size cells small, keep the router simple, share nothing between cells.

Netflix

Chaos engineering and the Simian Army

Netflix pioneered chaos engineering with Chaos Monkey (2011) — a tool that randomly kills production instances during business hours. The hypothesis: if our architecture is truly resilient, killing any single instance should have zero user impact. When an instance death causes a blip, Netflix fixes the gap and the system gets stronger.

The Simian Army expanded the concept: Latency Monkey injects delays into inter-service calls (testing timeouts and circuit breakers). Conformity Monkey checks that instances follow best practices (no root SSH keys, correct security groups). Chaos Kong simulates an entire AWS region going offline (testing multi-region failover).

Key insight: Netflix runs chaos experiments in production, not just staging. Staging environments are never realistic enough — they have different traffic patterns, different data volumes, and different configurations. Production chaos with blast-radius controls (cell architecture, traffic percentage limits) is the only way to validate HA for real.

Results: Netflix achieves ~99.99% availability for its streaming service — serving 260M+ subscribers across 190+ countries. When AWS us-east-1 experienced a major outage in 2017, Netflix continued streaming from other regions with minimal impact.

Tooling evolution: Netflix open-sourced Chaos Monkey and later contributed to the Chaos Engineering community. Modern successors include AWS Fault Injection Simulator (FIS), Gremlin, LitmusChaos (Kubernetes-native), and Steadybit.

Takeaway

Chaos engineering in production — not staging — is the only way to validate HA claims. Start with instance kills, graduate to region-level failures. Fix every gap you find.

Google

SRE error budgets and availability targets

Google's Site Reliability Engineering (SRE) introduced the concept of error budgets — turning availability from a vague goal into a measurable, negotiable resource. The key insight: 100% availability is the wrong target. It's infinitely expensive and prevents any change (every deployment is a risk).

How error budgets work: if the SLO is 99.95%, the error budget is 0.05% — about 22 minutes of downtime per month. Product teams can "spend" this budget on deployments, experiments, and migrations. If the budget is healthy (few incidents), ship features aggressively. If the budget is burned (too many incidents), freeze features and focus on reliability.

Tiered SLOs: Google assigns different SLOs to different services based on user impact:

Search: 99.99% (revenue impact is seconds)
Gmail: 99.95% (users tolerate brief delays)
Internal tools: 99.9% (employees can wait)

SLI selection matters: the Service Level Indicator must match what users care about. For a web service, SLI = "proportion of requests that return successfully within 300ms." For a batch pipeline, SLI = "proportion of jobs that complete within the deadline." A bad SLI (e.g., measuring CPU utilization instead of user-facing latency) makes the error budget meaningless.

Blameless post-mortems: when an incident burns error budget, Google conducts a blameless post-mortem focused on systemic fixes, not individual fault. Every post-mortem produces action items: better monitoring, improved runbooks, architecture changes. This culture of continuous improvement is what sustains high availability over years.

Numbers: Google operates 20+ data centres worldwide, serves 8.5B+ searches per day, and maintains 99.99%+ availability on its core services. This is achieved not through perfect technology but through the SRE framework: error budgets, SLOs, toil reduction, and blameless post-mortems.

Takeaway

Error budgets turn availability into a negotiable resource. Set SLOs by user impact, measure with user-facing SLIs, and use blameless post-mortems to continuously improve.

Decision levers

Availability target (nines)

Pick a concrete number: 99.9% / 99.95% / 99.99% / 99.999%. Each additional nine is roughly 10x the cost in infrastructure and operational discipline. 99.99% (52 min/year) demands multi-AZ, automated failover, < 5 min MTTR, and weekly chaos exercises. 99.999% (5.26 min/year) requires cell architecture and multi-region. Don't promise nines you can't measure — if you don't have SLIs tracking user-facing availability, your SLO is fiction.

Redundancy topology

Multi-AZ is the baseline for 99.9-99.95%. Multi-region for 99.99%+. Active-passive for DR with minutes-level failover. Active-active for zero-RTO but with conflict resolution complexity. Cell architecture for blast-radius containment at scale. The topology must cover every layer: LBs, app tier, data tier, cache, control plane. "Redundant app + single DB" is half-HA.

Graceful degradation strategy

Build a degradation matrix: for each dependency, define the user experience when it's down. This is a design decision, not an afterthought. Search down = show banner. Cache miss = fall through to DB (higher latency). Recommendations down = show popular items. Payment processor slow = show clear retry message. Feature flags enable runtime degradation without deployment.

Failure detection speed

Your MTTR is bounded by your detection time. Health check interval x failure threshold = detection time. With 10s interval and 3-failure threshold, detection takes 30 seconds. Add promotion time (5-10s for DB failover) and DNS TTL (30-60s). Total failover time: 1-2 minutes. To achieve < 1 min failover, reduce health check intervals (5s), lower TTLs, and use connection draining rather than hard cutover.

Blast-radius containment

Bulkheads isolate thread/connection pools per dependency. Circuit breakers fail fast and prevent cascading failures. Cell architecture limits the impact of correlated failures to one cell. The question is not whether to contain blast radius but how granularly: per-dependency (bulkhead), per-circuit (breaker), or per-shard (cell).

Failure modes

Untested failover

The claim "we'll fail over to the standby" has never been exercised. When the day comes, configs have drifted, credentials expired, replication broke silently. Drill quarterly or your HA is theoretical.

No timeouts on remote calls

One slow dependency drags every request into its slowness. Without timeouts, threads pile up and the caller dies too. Every remote call needs a timeout tuned to the request's latency budget.

Cascading retry stormsAdvanced

Downstream degrades -> upstream retries aggressively -> downstream collapses under retry load. Fix: exponential backoff with jitter, circuit breakers, and load-shedding at the edge. Never retry at every layer — pick one layer to retry.

Single-AZ control planeAdvanced

Auth, config, or service-discovery running on a single-AZ control plane. The entire fleet dies when the control plane's AZ goes down. Replicate the control plane across AZs too.

False-sense redundancy

"We have 3 replicas" — all in one AZ. Surviving an AZ outage requires distribution ACROSS AZs, not more instances within one. Check your AZ spread, not just your instance count.

Thundering herd on failoverAdvanced

Cache fails over -> all requests hit DB simultaneously -> DB collapses. Fix: stagger cache warming, use request coalescing (singleflight), and have circuit breakers on the DB path.

DNS TTL too long

DNS TTL of 300s means clients continue hitting a dead endpoint for 5 minutes after failover. For HA systems, use TTLs of 30-60 seconds. Balance between failover speed and DNS infrastructure load.

Decision table

HA topology decision matrix

Dimension	Single server	Redundant app	Fully redundant	Cell architecture
Target SLO	99%	99.9%	99.99%	99.999%
Annual downtime	3.65 days	8.76 hours	52.6 min	5.26 min
SPOF count	Everything	DB primary	None	Cell router only
Infra cost multiplier	1x	2x	3x	5-10x
Failover	Manual SSH	Manual DB promote	Automated	Automated per-cell
Blast radius	100%	100% (DB fail)	100% (correlated)	5-10% per cell
Team requirement	One dev	Small team	Dedicated SRE	SRE + platform team
Deployment safety	Downtime deploy	Rolling update	Blue-green	Cell-by-cell canary

Each nine is roughly 10x the infrastructure and operational cost.
Cell architecture achieves five nines by making each cell independently four nines.

Worked example

Worked example: Payment processing system HA

A payment processing service handles 50,000 transactions per minute. The business requires 99.99% availability — 52 minutes of downtime per year — because every minute of outage means lost revenue and customer trust.

Architecture diagram· Redundancy at every tier — no single point of failure

Two LBs front N app servers spread across AZs. Primary DB with synchronous replicas and automatic failover. Health checks at every hop.

Step 1 — Define the availability budget

99.99% = 4.38 minutes of downtime per month. This means MTTR must be under 3 minutes (leaving margin for detection). Human-driven failover is too slow; everything must be automated.

Step 2 — Eliminate every SPOF

Load balancers: AWS NLB (multi-AZ, no single instance to fail). DNS with 30-second TTL pointing to the NLB.

App tier: 6 stateless instances (N+2) across 3 AZs. Each AZ has 2 instances. Losing an entire AZ leaves 4 instances, still above peak capacity. Session state in Redis, not in process memory. Health checks every 10s with 3-failure threshold = 30s detection.

Database: PostgreSQL with Patroni (Raft-based leader election). 3-node cluster across 3 AZs. Synchronous replication to one standby (RPO = 0). Leader failure triggers automatic election in < 10 seconds. Connection pooler (PgBouncer) per AZ to handle connection storms during failover.

Cache: Redis Sentinel cluster (3 sentinels, 1 primary + 2 replicas). Sentinel promotes replica on primary failure. The app survives a cold cache — queries hit the DB directly with higher latency but correct results. Request coalescing (singleflight) prevents thundering herd.

Step 3 — Circuit breakers and degradation

Every external dependency has its own circuit breaker:

Dependency	Timeout	Breaker threshold	Degradation
Card network (Visa/MC)	5s	40% over 20s	Queue for retry in 60s
Fraud detection	2s	50% over 30s	Allow with flag for async review
Notification service	3s	50% over 30s	Queue notification for later
Analytics pipeline	1s	Any failure	Buffer to local disk

Fraud detection degradation is a business decision: when the fraud service is down, do you block all transactions (safe but revenue-killing) or allow them with a flag for async review (risky but revenue-preserving)? We chose the latter because the fraud rate is < 0.1% and async review catches 95% of fraud within 5 minutes.

Step 4 — Blast-radius containment

Bulkhead pools: the card-network HTTP client has its own thread pool (20 threads). If Visa is slow, only those 20 threads block. The remaining thread pool (80 threads) continues serving fraud checks, notifications, and health checks.

Idempotency: every transaction has a client-generated idempotency key. Retries are safe — the system detects duplicate keys and returns the original response. This is non-negotiable for payments: a retry must never double-charge a customer.

Step 5 — Operational discipline

Weekly game days: Week 1: kill a random app instance. Week 2: inject 3s latency into the card network. Week 3: block traffic to one AZ. Week 4: fail the Redis primary.

Quarterly region failover: simulate a full AZ outage in production. Validate that the remaining 2 AZs absorb load, DB fails over, and error rate stays below 0.01%.

Error budget tracking: dashboard shows remaining budget (4.38 min/month). When budget drops below 50%, pager alerts on-call SRE. When budget is exhausted, feature freeze until reliability improves.

Post-incident reviews: every incident that burns > 30 seconds of error budget gets a blameless post-mortem. Action items are tracked to completion. Recurring themes drive architecture investments (e.g., "we keep having DB failover issues" → invest in Patroni tuning and automated testing).

The result

This architecture achieves 99.99% availability because: (1) no single component failure causes an outage, (2) automated detection + failover keeps MTTR under 3 minutes, (3) circuit breakers and degradation prevent cascading failures, (4) operational discipline (game days, error budgets, post-mortems) continuously validates and improves the system.

Interview playbook

Interview playbook8-12 minutes in a 45-minute interview. Lead with the target number and math, then layer in redundancy, circuit breakers, and operational discipline.

When it comes up

Prompt mentions "high availability," "99.9%," "nines," or "SLA"
System handles payments, auth, or healthcare — downtime = liability
Interviewer asks "what happens when X fails?"
"Design a system that survives an AZ outage"
Any mention of "failover," "redundancy," or "disaster recovery"

Order of reveal

1
1. Name the target. "Let me start by setting a concrete availability target. For a payment system, I'd target 99.99% — 52 minutes of downtime per year. That constrains everything else."
2
2. Availability math. "99.99% means MTTR under 5 minutes — human failover is too slow, so everything must be automated. And serial dependencies multiply failure rates: three 99.9% services in series cap us at 99.7%."
3
3. Redundancy at every layer. "Multi-AZ LBs, N+2 stateless app instances, quorum-replicated DB with automatic leader election. No single point of failure — including the control plane."
4
4. Circuit breakers + degradation. "Every dependency gets a timeout, a circuit breaker, and a documented degradation path. Search down shows a banner; cache down falls through to DB; analytics buffers to disk."
5
5. Blast-radius containment. "Bulkhead thread pools per dependency. For five-nines, cell architecture: partition into independent cells so a failure affects only one shard of users."
6
6. Failover mechanics. "Health checks every 10s, 3-failure threshold = 30s detection. DB failover via Raft election in <10s. DNS TTL 30s. Total failover time: ~1 minute."
7
7. Operational discipline. "Untested HA isn't HA. Weekly game days, quarterly region-failover drills, error budgets tracked monthly. When budget burns, feature freeze until reliability improves."

Signature phrases

“Each nine costs 10x”

“Serial dependencies multiply failure rates”

“Untested failover is no failover”

“Degrade gracefully, fail fast”

“Blast radius of one cell”

“Error budget, not error prayer”

“Each nine costs 10x” — Shows you understand the economics, not just the technology.
“Serial dependencies multiply failure rates” — Demonstrates availability math fluency.
“Untested failover is no failover” — Signals operational maturity and chaos engineering awareness.
“Degrade gracefully, fail fast” — Captures the two key HA strategies in four words.
“Blast radius of one cell” — Shows knowledge of cell architecture and advanced HA patterns.
“Error budget, not error prayer” — Signals SRE discipline and measurable reliability.

Likely follow-ups

?“How do you handle a full AZ outage?”Reveal

"App tier: N+2 across 3 AZs means losing one AZ leaves us above peak capacity. DB: quorum cluster across 3 AZs — losing one AZ means 2/3 nodes survive, which is still a quorum. Leader election if the failed AZ had the leader. Cache: Redis Sentinel promotes a replica. LB: managed LBs are multi-AZ by default. DNS TTL 30s. Total impact: a brief spike in latency as connections re-establish, but zero downtime."

?“What if your DB failover takes too long?”Reveal

"Raft-based leader election (Patroni, etcd) completes in <10 seconds. If that's too slow: (1) connection pooler (PgBouncer) queues requests during election rather than failing them; (2) read traffic continues on surviving replicas; (3) write traffic gets a brief 503 with Retry-After header. For true zero-downtime, use a multi-primary setup like CockroachDB, but accept the complexity tax."

?“How do you test your HA?”Reveal

"Four types of chaos: (1) Instance kill — weekly, automated, validates LB routing. (2) Dependency injection — inject latency/errors, validate circuit breakers. (3) AZ blackout — quarterly, block all traffic to one AZ. (4) Full region failover — quarterly in production. Every experiment has a hypothesis, a blast-radius limit, and a rollback plan. We also track MTTR from real incidents and game days."

?“How do you handle cascading failures?”Reveal

"Three layers of defence: (1) Timeouts on every remote call — a slow dependency cannot hold our threads indefinitely. (2) Circuit breakers — after N failures, fail fast and return fallback/cached response. (3) Bulkheads — isolated thread pools per dependency so one slow path can't consume all resources. And at the edge, load-shedding: if we're overwhelmed, reject low-priority traffic (Tier 3) to protect checkout and auth (Tier 1)."

?“What about data consistency during failover?”Reveal

"With synchronous replication (RPO = 0), the standby has every committed transaction. Failover promotes it with zero data loss. With async replication, there's a window of uncommitted transactions (typically < 1 second). For payments, we use synchronous replication and accept the latency cost. For less critical data (analytics, logs), async is fine — we accept brief data loss."

?“How do you decide what to degrade?”Reveal

"The degradation matrix is a design artifact, not an incident improvisation. For each dependency, we document: (1) what breaks when it's down, (2) what the user sees instead, (3) whether it's automated or requires manual intervention. Priority: protect the revenue path (checkout, login) at all costs. Shed non-critical features (recommendations, analytics) first. Feature flags let us toggle degradation in real-time without deployment."

Code snippets

pythonCircuit breaker implementation

import time
from enum import Enum
from threading import Lock

class State(Enum):
    CLOSED = "closed"
    OPEN = "open"
    HALF_OPEN = "half_open"

class CircuitBreaker:
    def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max=1):
        self.failure_threshold = failure_threshold
        self.recovery_timeout = recovery_timeout
        self.half_open_max = half_open_max
        self.state = State.CLOSED
        self.failure_count = 0
        self.last_failure_time = 0
        self.half_open_calls = 0
        self._lock = Lock()

    def call(self, func, *args, fallback=None, **kwargs):
        with self._lock:
            if self.state == State.OPEN:
                if time.time() - self.last_failure_time > self.recovery_timeout:
                    self.state = State.HALF_OPEN
                    self.half_open_calls = 0
                else:
                    return fallback() if fallback else None  # fail fast

            if self.state == State.HALF_OPEN and self.half_open_calls >= self.half_open_max:
                return fallback() if fallback else None

        try:
            result = func(*args, **kwargs)
            with self._lock:
                if self.state == State.HALF_OPEN:
                    self.state = State.CLOSED  # recovered
                self.failure_count = 0
            return result
        except Exception:
            with self._lock:
                self.failure_count += 1
                self.last_failure_time = time.time()
                if self.state == State.HALF_OPEN:
                    self.state = State.OPEN
                elif self.failure_count >= self.failure_threshold:
                    self.state = State.OPEN
            return fallback() if fallback else None

pythonHealth check endpoint (liveness + readiness)

from flask import Flask, jsonify
import psycopg2
import redis

app = Flask(__name__)
db_pool = psycopg2.pool.ThreadedConnectionPool(1, 10, dsn="...")
redis_client = redis.Redis(host="cache", port=6379)

@app.route("/healthz")  # liveness: is the process alive?
def liveness():
    return jsonify({"status": "alive"}), 200

@app.route("/readyz")  # readiness: can we serve traffic?
def readiness():
    checks = {}
    try:
        conn = db_pool.getconn()
        conn.cursor().execute("SELECT 1")
        db_pool.putconn(conn)
        checks["db"] = "ok"
    except Exception as e:
        checks["db"] = str(e)
        return jsonify({"status": "not_ready", "checks": checks}), 503

    try:
        redis_client.ping()
        checks["cache"] = "ok"
    except Exception as e:
        checks["cache"] = str(e)
        return jsonify({"status": "not_ready", "checks": checks}), 503

    return jsonify({"status": "ready", "checks": checks}), 200

pythonGraceful degradation middleware

from functools import wraps
from flask import jsonify
import logging

logger = logging.getLogger(__name__)

# Degradation matrix: dependency -> fallback behaviour
DEGRADATION_MATRIX = {
    "search": lambda: {"results": [], "degraded": True, "message": "Search temporarily unavailable"},
    "recommendations": lambda: {"items": get_popular_items(), "degraded": True},
    "analytics": lambda: None,  # silently skip
}

def with_degradation(dependency_name):
    """Decorator: if the wrapped call fails, return degraded response."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            try:
                return func(*args, **kwargs)
            except Exception as e:
                logger.warning(f"{dependency_name} failed: {e}, degrading gracefully")
                fallback = DEGRADATION_MATRIX.get(dependency_name)
                if fallback:
                    return fallback()
                raise  # no fallback defined -> propagate
        return wrapper
    return decorator

@with_degradation("search")
def search_products(query):
    return search_service.query(query)

@with_degradation("recommendations")
def get_recommendations(user_id):
    return recommendation_service.get(user_id)

def get_popular_items():
    # Cached popular items, refreshed hourly
    return cache.get("popular_items", default=[])

pythonRetry with exponential backoff and jitter

import time
import random
from functools import wraps

def retry_with_backoff(max_retries=3, base_delay=0.5, max_delay=30.0, jitter=True):
    """Retry a function with exponential backoff and optional jitter.
    
    Jitter prevents thundering herd when many clients retry simultaneously.
    Without jitter, all clients retry at exactly the same intervals,
    creating periodic traffic spikes that can re-overload a recovering service.
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            last_exception = None
            for attempt in range(max_retries + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    last_exception = e
                    if attempt == max_retries:
                        raise
                    delay = min(base_delay * (2 ** attempt), max_delay)
                    if jitter:
                        delay = delay * (0.5 + random.random())  # 50-150% of delay
                    time.sleep(delay)
            raise last_exception
        return wrapper
    return decorator

@retry_with_backoff(max_retries=3, base_delay=1.0)
def call_payment_processor(transaction):
    return payment_api.charge(transaction)

pythonAvailability SLA calculator

def serial_availability(*components: float) -> float:
    """Serial: multiply availabilities. Weakest link dominates."""
    result = 1.0
    for a in components:
        result *= a
    return result

def parallel_availability(*components: float) -> float:
    """Parallel: multiply downtimes, then subtract from 1."""
    downtime = 1.0
    for a in components:
        downtime *= (1.0 - a)
    return 1.0 - downtime

def downtime_per_year(availability: float) -> str:
    """Convert availability to human-readable annual downtime."""
    minutes = (1 - availability) * 365.25 * 24 * 60
    if minutes < 60:
        return f"{minutes:.1f} min/year"
    hours = minutes / 60
    if hours < 24:
        return f"{hours:.1f} hours/year"
    return f"{hours / 24:.1f} days/year"

# Example: three services in series, each 99.9%
serial = serial_availability(0.999, 0.999, 0.999)
print(f"Serial (3x 99.9%):   {serial*100:.4f}% = {downtime_per_year(serial)}")
# -> Serial (3x 99.9%):   99.7003% = 2.6 hours/year ... wait, actually ~26 hours

# Two instances in parallel, each 99.9%
parallel = parallel_availability(0.999, 0.999)
print(f"Parallel (2x 99.9%): {parallel*100:.6f}% = {downtime_per_year(parallel)}")
# -> Parallel (2x 99.9%): 99.999900% = 0.5 min/year

# Real system: serial chain of parallel groups
app_tier = parallel_availability(0.999, 0.999, 0.999)  # 3 app nodes
db_tier = parallel_availability(0.999, 0.999)            # primary + replica
system = serial_availability(app_tier, db_tier)
print(f"System:              {system*100:.6f}% = {downtime_per_year(system)}")

Drills

Your system has three 99.9% dependencies called in series. What is the best-case combined SLO?Reveal

99.9%^3 = 99.7% — about 26 hours of downtime per year. You cannot be more available than your dependency chain. Fixes: parallelise calls where possible, add caching to short-circuit dependencies, use async processing for non-critical calls, or accept the math and set your SLO below the chain.

An interviewer says "prove your failover works." What do you show?Reveal

Three things: (1) Game-day logs from the last quarter — timestamps, failure modes injected, MTTR measured. (2) Chaos engineering coverage: which failure modes have been injected (instance kill, AZ blackout, dependency failure) and what gaps were found. (3) Real incident MTTR metrics. If none of these exist, the honest answer is "we don't know — and that's the first thing I'd change."

What is the difference between liveness and readiness health checks?Reveal

Liveness: "is the process alive?" Simple HTTP 200 from /healthz. Failure = restart the container. Keep it simple — never check external dependencies. Readiness: "can this instance serve traffic?" Checks DB connectivity, cache warmth, deployment health. Failure = remove from LB pool but don't restart. Critical for zero-downtime deployments.

Your cache cluster fails over. What happens to the database?Reveal

Thundering herd: all cached reads suddenly hit the DB. Mitigation: (1) request coalescing (singleflight) — deduplicate concurrent requests for the same key. (2) Pre-warm the new cache primary with hot keys before promoting. (3) Circuit breaker on the DB path — if DB is overwhelmed, return degraded response. (4) Gradually shift traffic to the new primary rather than instant cutover.

How does cell architecture help with deployments?Reveal

Deploy to one cell (canary) first. Monitor error rate, latency, and business metrics for 15-30 minutes. If clean, roll to the next batch. If bad, roll back only the canary — 90-95% of users never saw the bad code. This is fundamentally safer than deploying to the entire fleet. It also limits the blast radius of bad config changes, data migrations, and feature flag rollouts.

Your SLO is 99.99% but your payment processor only guarantees 99.9%. How do you reconcile?Reveal

You cannot exceed the SLA of a serial dependency. Options: (1) Add redundancy — use two payment processors and failover between them (parallel composition). (2) Queue and retry — if the processor is down, queue the transaction and retry when it recovers (async decoupling). (3) Negotiate — pay for a higher-tier SLA from the processor. (4) Accept — set your checkout-flow SLO at 99.9% and document that the payment processor is the bottleneck.

6% complete

Current

When to reach for this

Step 1 of 17

Good vs bad answer

Jump to next

Target

Downtime / year

Downtime / month

Practical meaning

99% ("two nines")

3.65 days

7.3 hours

Internal tools

99.9% ("three nines")

8.76 hours

43.8 min

Standard SaaS

99.95%

4.38 hours

21.9 min

Business-critical

99.99% ("four nines")

52.6 min

4.38 min

Payments, auth

99.999% ("five nines")

5.26 min

26.3 sec

Telecom, 911

Dependency

Normal behaviour

Degraded behaviour

Search service

Full-text search

Show "search unavailable" banner

Recommendation engine

Personalised feed

Show trending/popular

Analytics pipeline

Real-time tracking

Buffer to local disk, backfill

Image processing

Generate thumbnails

Serve original image (larger)

Payment processor

Instant checkout

"Try again in 1 minute" message

Notification service

Push + email

Queue for later delivery

Injection

What it tests

Tool

Instance kill

LB health checks, auto-scaling

Chaos Monkey

AZ blackout

Multi-AZ failover, data replication

AWS FIS

Latency injection

Timeouts, circuit breakers

Gremlin, Toxiproxy

DNS failure

DNS caching, fallback resolution

Custom iptables

Dependency error

Circuit breaker, graceful degradation

Istio fault injection

Clock skew

Certificate validation, token expiry

Chronos

Disk full

Log rotation, data storage failover

Custom scripts

Dimension

Single server

Redundant app

Fully redundant

Cell architecture

Target SLO

99%

99.9%

99.99%

99.999%

Annual downtime

3.65 days

8.76 hours

52.6 min

5.26 min

SPOF count

Everything

DB primary

None

Cell router only

Infra cost multiplier

5-10x

Failover

Manual SSH

Manual DB promote

Automated

Automated per-cell

Blast radius

100%

100% (DB fail)

100% (correlated)

5-10% per cell

Team requirement

One dev

Small team

Dedicated SRE

SRE + platform team

Deployment safety

Downtime deploy

Rolling update

Blue-green

Cell-by-cell canary

Dependency

Timeout

Breaker threshold

Degradation

Card network (Visa/MC)

40% over 20s

Queue for retry in 60s

Fraud detection

50% over 30s

Allow with flag for async review

Notification service

50% over 30s

Queue notification for later

Analytics pipeline

Any failure

Buffer to local disk

import time from enum import Enum from threading import Lock class State(Enum): CLOSED = "closed" OPEN = "open" HALF_OPEN = "half_open" class CircuitBreaker: def __init__(self, failure_threshold=5, recovery_timeout=30, half_open_max=1): self.failure_threshold = failure_threshold self.recovery_timeout = recovery_timeout self.half_open_max = half_open_max self.state = State.CLOSED self.failure_count = 0 self.last_failure_time = 0 self.half_open_calls = 0 self._lock = Lock() def call(self, func, *args, fallback=None, **kwargs): with self._lock: if self.state == State.OPEN: if time.time() - self.last_failure_time > self.recovery_timeout: self.state = State.HALF_OPEN self.half_open_calls = 0 else: return fallback() if fallback else None # fail fast if self.state == State.HALF_OPEN and self.half_open_calls >= self.half_open_max: return fallback() if fallback else None try: result = func(*args, **kwargs) with self._lock: if self.state == State.HALF_OPEN: self.state = State.CLOSED # recovered self.failure_count = 0 return result except Exception: with self._lock: self.failure_count += 1 self.last_failure_time = time.time() if self.state == State.HALF_OPEN: self.state = State.OPEN elif self.failure_count >= self.failure_threshold: self.state = State.OPEN return fallback() if fallback else None

from flask import Flask, jsonify import psycopg2 import redis app = Flask(__name__) db_pool = psycopg2.pool.ThreadedConnectionPool(1, 10, dsn="...") redis_client = redis.Redis(host="cache", port=6379) @app.route("/healthz") # liveness: is the process alive? def liveness(): return jsonify({"status": "alive"}), 200 @app.route("/readyz") # readiness: can we serve traffic? def readiness(): checks = {} try: conn = db_pool.getconn() conn.cursor().execute("SELECT 1") db_pool.putconn(conn) checks["db"] = "ok" except Exception as e: checks["db"] = str(e) return jsonify({"status": "not_ready", "checks": checks}), 503 try: redis_client.ping() checks["cache"] = "ok" except Exception as e: checks["cache"] = str(e) return jsonify({"status": "not_ready", "checks": checks}), 503 return jsonify({"status": "ready", "checks": checks}), 200

from functools import wraps from flask import jsonify import logging logger = logging.getLogger(__name__) # Degradation matrix: dependency -> fallback behaviour DEGRADATION_MATRIX = { "search": lambda: {"results": [], "degraded": True, "message": "Search temporarily unavailable"}, "recommendations": lambda: {"items": get_popular_items(), "degraded": True}, "analytics": lambda: None, # silently skip } def with_degradation(dependency_name): """Decorator: if the wrapped call fails, return degraded response.""" def decorator(func): @wraps(func) def wrapper(*args, **kwargs): try: return func(*args, **kwargs) except Exception as e: logger.warning(f"{dependency_name} failed: {e}, degrading gracefully") fallback = DEGRADATION_MATRIX.get(dependency_name) if fallback: return fallback() raise # no fallback defined -> propagate return wrapper return decorator @with_degradation("search") def search_products(query): return search_service.query(query) @with_degradation("recommendations") def get_recommendations(user_id): return recommendation_service.get(user_id) def get_popular_items(): # Cached popular items, refreshed hourly return cache.get("popular_items", default=[])

import time import random from functools import wraps def retry_with_backoff(max_retries=3, base_delay=0.5, max_delay=30.0, jitter=True): """Retry a function with exponential backoff and optional jitter. Jitter prevents thundering herd when many clients retry simultaneously. Without jitter, all clients retry at exactly the same intervals, creating periodic traffic spikes that can re-overload a recovering service. """ def decorator(func): @wraps(func) def wrapper(*args, **kwargs): last_exception = None for attempt in range(max_retries + 1): try: return func(*args, **kwargs) except Exception as e: last_exception = e if attempt == max_retries: raise delay = min(base_delay * (2 ** attempt), max_delay) if jitter: delay = delay * (0.5 + random.random()) # 50-150% of delay time.sleep(delay) raise last_exception return wrapper return decorator @retry_with_backoff(max_retries=3, base_delay=1.0) def call_payment_processor(transaction): return payment_api.charge(transaction)

def serial_availability(*components: float) -> float: """Serial: multiply availabilities. Weakest link dominates.""" result = 1.0 for a in components: result *= a return result def parallel_availability(*components: float) -> float: """Parallel: multiply downtimes, then subtract from 1.""" downtime = 1.0 for a in components: downtime *= (1.0 - a) return 1.0 - downtime def downtime_per_year(availability: float) -> str: """Convert availability to human-readable annual downtime.""" minutes = (1 - availability) * 365.25 * 24 * 60 if minutes < 60: return f"{minutes:.1f} min/year" hours = minutes / 60 if hours < 24: return f"{hours:.1f} hours/year" return f"{hours / 24:.1f} days/year" # Example: three services in series, each 99.9% serial = serial_availability(0.999, 0.999, 0.999) print(f"Serial (3x 99.9%): {serial*100:.4f}% = {downtime_per_year(serial)}") # -> Serial (3x 99.9%): 99.7003% = 2.6 hours/year ... wait, actually ~26 hours # Two instances in parallel, each 99.9% parallel = parallel_availability(0.999, 0.999) print(f"Parallel (2x 99.9%): {parallel*100:.6f}% = {downtime_per_year(parallel)}") # -> Parallel (2x 99.9%): 99.999900% = 0.5 min/year # Real system: serial chain of parallel groups app_tier = parallel_availability(0.999, 0.999, 0.999) # 3 app nodes db_tier = parallel_availability(0.999, 0.999) # primary + replica system = serial_availability(app_tier, db_tier) print(f"System: {system*100:.6f}% = {downtime_per_year(system)}")