Failure mode analysis
What fails, blast radius, graceful degradation, retries, circuit breakers.
Systems don't fail because you didn't think they could. They fail the way you failed to think about. Failure-mode analysis is structured paranoia — and interviewers grade on whether you can produce it on demand.
Read this if your last attempt…
- You said "it's highly available" without naming what fails and what users see
- Your interviewer asked "what if X goes down?" and you froze
- You can't explain the difference between fail-open and fail-closed
- You haven't heard of circuit breakers or bulkheads
The concept
The four questions, per component
For every component in your diagram, ask:
Without circuit breakers and timeouts, a slow DB causes thread-pool exhaustion in Service B, which backs up Service A, which saturates the gateway. Every user sees 503s.
Resilience patterns — what each protects against.
| Pattern | Protects against | How it works | When to use |
|---|---|---|---|
| Timeout | Slow dependencies | Cap wait time; fail fast | Every remote call — no exceptions |
| Retry + backoff + jitter | Transient failures | Retry N times with exponential delay + random jitter | Idempotent operations only |
| Circuit breaker | Sustained downstream failure | After N failures in window, stop calling; probe periodically to recover | Slow or flapping dependencies |
| Bulkhead | Cascade / thread exhaustion | Isolate thread pools per dependency so one slow service can't starve others | Critical services calling multiple backends |
| Fallback / graceful degradation | Any dependency outage | Serve stale cache, return partial response, show degraded UX | User-facing paths where some answer beats no answer |
| Rate limiting / load shedding | Overload | Reject excess traffic with 429; protect downstream capacity | Inbound edge; expensive endpoints |
| Health check + drain | Dead or unhealthy instances | Periodic probe; remove failed instances from LB pool | Every load-balanced tier |
How interviewers grade this
- You name specific failures per component, not "the DB might go down".
- You give blast radius in user-facing terms ("users see stale data for 30s", not "the replica lags").
- You have a retry strategy that bounds amplification (jitter, cap, circuit breaker).
- You distinguish between fail-open (let traffic through) and fail-closed (reject).
- You name at least one graceful-degradation fallback for each critical path.
- You raise failure modes proactively, not just when asked.
Variants
Circuit breaker
Stop calling a failing service; probe periodically to detect recovery.
Closed (normal) → Open (fail-fast after threshold) → Half-open (let one probe through) → back to Closed on success, or back to Open on failure.
A circuit breaker has three states:
- 1Closed (normal) — requests flow through. Failures are counted.
- 2Open — after the failure threshold is exceeded (e.g. >50% errors in last 10s), all requests are immediately rejected or served from fallback. No load on the failing service.
- 3Half-open — after a cooldown period (e.g. 30s), let one probe request through. If it succeeds, close the circuit. If it fails, reopen.
The key insight: without a circuit breaker, retries actually accelerate the cascade. 100 callers retrying a slow service 3× each turns 100 req/s into 300 req/s of load on an already overwhelmed backend. A circuit breaker cuts that to zero.
Implementation: most languages have libraries (Hystrix/Resilience4j for Java, Polly for .NET, opossum for Node). In service meshes, Envoy/Istio can do it at the sidecar level without app code.
Tuning: failure threshold (too low = flapping; too high = slow detection), cooldown window (too short = hammering; too long = slow recovery), and the scope (per-host vs per-service circuit).
Pros
- +Prevents cascade failures
- +Gives failing services breathing room to recover
- +Fast fail — users see a degraded response immediately instead of waiting for timeout
Cons
- −Must define fallback behaviour for every circuit
- −Tuning thresholds is an art — too sensitive = false trips
- −Per-host circuits need per-host health tracking
Choose this variant when
- Calling any remote dependency on the critical path
- Dependency has a history of degraded performance
- You need fast-fail rather than slow-timeout behaviour
Bulkhead isolation
Isolate thread/connection pools so one bad dependency can't starve others.
One caller, separate thread pools per downstream. A slow payments backend fills only its own pool; DB and cache threads keep serving their traffic.
Named after ship bulkheads — watertight compartments that prevent a single breach from sinking the whole vessel.
In software: if your service calls three backends (DB, cache, recommendations API) on a shared thread pool of 200 threads, a slow recommendations API can consume all 200 threads waiting on timeouts. Now DB and cache calls — which are healthy — also fail because there are no threads to serve them.
Fix: dedicate separate thread pools (or connection pools, or semaphores) per dependency. If recommendations is slow, its pool of 50 threads fills up and requests to it fail fast, but the other 150 threads continue serving DB and cache calls normally.
Real-world: Hystrix pioneered this in Netflix's architecture. Modern service meshes implement it via connection limits and circuit breakers per upstream. Kubernetes resource limits are a form of bulkhead at the container level.
Sizing: too small = starve healthy traffic under normal load; too large = defeat the purpose. Monitor pool utilisation and tune.
Pros
- +Prevents cross-contamination between dependencies
- +A slow API can't block unrelated fast paths
- +Precise per-dependency capacity management
Cons
- −More complex than a shared pool
- −Must size each pool correctly
- −Underutilised pools waste resources under normal load
Choose this variant when
- Service calls 3+ backend dependencies
- Criticality differs across dependencies (payments vs recommendations)
- You've seen cross-dependency contamination in production
Graceful degradation
Serve partial/stale results instead of total failure when a dependency is down.
When the primary path fails, fall through the hierarchy in order — each tier is cheaper and less fresh than the one above. Never serve the user a 500 if a degraded answer exists.
The hierarchy of degraded responses, from best to worst:
- 1Serve stale from cache — user gets slightly old data but doesn't notice. Best UX. Example: search recommendations from a 10-minute-old cache instead of calling the ML service.
- 2Partial response — omit the broken section. Example: show the tweet timeline but hide "who to follow" when the recommendation service is down.
- 3Static fallback — return a generic/default response. Example: show trending topics from a static list when the personalisation engine is down.
- 4Honest error with ETA — "This feature is temporarily unavailable. We're working on it." Better than a 500 page.
- 5503 with Retry-After — for API consumers, signal when to try again.
The key: decide the fallback before the outage, not during it. Every dependency in your critical path should have a documented degradation strategy.
Real-world examples:
- Netflix: when the recommendation engine is down, serve most-popular lists instead. Users barely notice.
- Amazon: during checkout service degradation, disable the "customers also bought" widget but keep the purchase flow working.
- Google Search: when the spell-check service is slow, skip spell correction rather than delaying all results.
Pros
- +Users see a working (if reduced) experience
- +Prevents cascade from a non-critical dependency taking down a critical path
- +Reduces on-call pressure — partial outage instead of total
Cons
- −Must design and test every fallback path
- −Stale data can confuse users if it's too old
- −Partial responses need client-side awareness
Choose this variant when
- User-facing paths where some data is better than no data
- Non-critical dependencies on the hot path (recommendations, ads, analytics)
- You can cache or pre-compute reasonable defaults
Timeout budgeting
Every hop gets a timeout shorter than its caller's timeout — no exceptions.
End-to-end SLA decomposed across hops. Each hop's timeout is strictly shorter than its caller's remaining budget, so a slow hop fails fast and the caller still has time to respond.
The rule: your timeout for calling dependency X must be shorter than your caller's timeout for calling you. If Service A gives you 3 seconds, and you call Service B with a 5-second timeout, you'll return a timeout to A while B is still working — wasting resources.
Timeout budget: decompose the end-to-end SLA into per-hop budgets. If the user expects 500ms, and you have 3 serial calls, each gets ~150ms (with slack for compute). If one call can't fit in 150ms, you have a design problem — not a timeout-tuning problem.
Common mistakes:
- No timeout at all — the call hangs forever, consuming a thread, connection, and eventually cascading.
- Timeout = infinity "just in case" — same as no timeout.
- Same timeout everywhere — one-size-fits-all ignores the call graph depth. Inner services need tighter timeouts.
- Timeout without retry budget — timeout after 200ms, retry 3x = 600ms. Your budget is blown.
Deadline propagation: pass the remaining deadline as a header/context (gRPC does this natively with deadlines). Each service in the chain knows how much time is left and can fail fast if there's not enough.
Pros
- +Prevents unbounded waits
- +Forces explicit latency budgeting
- +gRPC deadline propagation does this automatically
Cons
- −Must tune per call — too tight = false failures
- −Timeout + retry = multiplicative latency
- −Requires understanding the full call graph
Choose this variant when
- Every remote call — literally all of them
- Multi-hop service chains
- You have a user-facing latency SLA
Worked example
Scenario: you've designed a URL shortener and the interviewer asks "what fails?"
Component-by-component walkthrough:
API Gateway: crashes → all users in that AZ affected. Detection: health check fails, LB drains in 10s. Degradation: multiple gateways behind ALB, auto-replace.
Write service: slow under burst writes → new short URLs fail. Detection: p99 > 500ms alert. Degradation: queue writes, return 202 with pending status.
Redis cache: memory full, eviction storm → all reads hit DB, latency jumps 50x. Detection: cache miss rate spikes > 80%. Degradation: DB serves reads at reduced QPS, cache-aside backfills on recovery.
Primary DB: disk full, replication lag, leader crash → writes fail, stale reads. Detection: pg_stat_replication lag, disk alerts. Degradation: auto-failover to replica (RDS multi-AZ ~30s), stale reads from replica during promotion.
DNS / short URL resolution: DNS provider outage → no one can resolve short URLs. Detection: synthetic monitors from multiple regions. Degradation: multi-provider DNS (Route53 + CloudFlare), TTL ensures cached resolutions survive ~5min.
Retry strategy: exponential backoff with jitter, cap at 3 retries, circuit-breaker on the DB call (open after 5 failures in 10s, half-open probe every 30s).
What the interviewer hears: "The most dangerous failure isn't the DB crashing — it's the DB being slow. A crash triggers failover in 30 seconds. A slow DB holds connections open, exhausts the pool, and cascades to the gateway. That's why I have a 200ms timeout on DB reads, a circuit breaker that opens after 5 timeouts in 10s, and a cache-first fallback that serves stale data during the brownout."
Good vs bad answer
Interviewer probe
“What happens when your database goes down?”
Weak answer
"We have replicas, so it'll failover. The system is highly available."
Strong answer
"Depends on the failure mode. If it's a crash, our RDS multi-AZ failover promotes the standby in ~30 seconds. During those 30 seconds, writes return 503 with Retry-After: 30, and reads serve from the Redis cache (stale by at most the last write window). If it's a slow degradation — which is worse — the circuit breaker on the DB call opens after 5 timeouts in 10 seconds, and all reads fall back to cache immediately while we investigate. The blast radius of a total DB loss is: no new short URLs can be created, but all existing redirects keep working from cache. We alert on replication lag > 1s and error rate > 5% at the DB client."
Why it wins: Distinguishes crash from slow failure, names the specific failover mechanism and timing, describes user-visible impact during the window, and identifies the circuit-breaker trigger.
When it comes up
- Always — interviewers will ask "what happens when X fails?" for at least one component
- Right after the HLD is drawn, before or during deep-dive
- When any dependency, external service, or database is introduced
- In senior rounds, expected to be raised proactively without prompting
- Whenever "high availability" or "reliability" becomes the topic
Order of reveal
- 1Apply the four-question frame per component. "For each component I ask: what fails, blast radius, detection, degradation. Let me walk through them on this diagram."
- 2Distinguish crash from slow failure. "The dangerous failure isn't the process crashing — failover handles that in 30 seconds. It's the component being slow but appearing alive. That's what cascades."
- 3State specific fallbacks, not "we have redundancy". "When the cache is down, we shed 90% of read traffic and serve critical auth from a dedicated DB pool. When the DB is slow, circuit breaker opens after 5 timeouts in 10s and reads fall back to cache."
- 4Name timeout + retry discipline. "Every remote call has a timeout. Inner timeout < outer timeout. Retries only on idempotent operations, with exponential backoff, jitter, and capped at 3 attempts behind a circuit breaker."
- 5Call out user-visible impact. "During DB failover, no new URLs can be created (503 + Retry-After 30), but all existing redirects keep working from cache. Creation pauses ~30 seconds; reads unaffected."
- 6Bring up fail-open vs fail-closed. "Auth and authz fail-closed — security over availability. Rate limits, recommendations, feature flags fail-open — availability over precision."
- 7Close with chaos discipline. "We don't wait for production to test failure modes — we inject faults in staging, then progressively in production with blast-radius caps."
Signature phrases
- “Slow failures cascade; crashes are detected fast” — The single most important insight — reveals real operational experience.
- “What fails, blast radius, detection, degradation” — The four-question frame in one phrase.
- “Inner timeout shorter than outer timeout” — Concrete discipline that prevents cascade.
- “Decide the fallback before the outage, not during it” — Pushes back on "we'll figure it out" thinking.
- “Fail-closed on security, fail-open on features” — Sharp heuristic for tough decisions.
- “Retries without jitter amplify the cascade” — Catches a common anti-pattern.
- “"We have redundancy" is a claim, not a design” — Forces specifics over platitudes.
Likely follow-ups
?“Walk me through a specific cascade failure. Start with one component slowing and end with a full outage.”Reveal
Scenario: social feed service, 4 tiers — gateway → feed-service → recommendation-service → ML-model-server.
t=0: ML model server starts GC-pausing for 2 seconds at a time (memory pressure from a new model version). It's still "alive" — responds to health checks, just slowly.
t=5s: Recommendation service calls ML with a 500 ms timeout. But no timeout is enforced because we forgot one on this new call path. Requests pile up waiting on ML. Recommendation's thread pool (200 threads) fills within 30 seconds as requests don't return.
t=35s: Recommendation service can't accept new requests. New calls from feed-service hit a full thread pool and either block or error.
t=40s: Feed-service, which calls recommendation with a 1 s timeout, starts seeing timeouts on 80% of requests. Its own thread pool starts filling (same mistake: some calls have no timeout). Feed-service error rate jumps to 50%.
t=60s: Gateway times out on feed-service at 2 s. Client error rate hits 40%. Circuit breaker at the gateway would trip here if one existed — but this codepath isn't wrapped. All threads on the gateway's feed pool are waiting.
t=90s: Gateway thread exhaustion. Health checks from the LB start timing out (same thread pool as user requests). LB marks gateway instances unhealthy and removes them. Full outage.
What would have prevented it:
- 1Mandatory timeouts on every remote call, enforced by framework or sidecar. Not optional.
- 2Circuit breaker on ML calls — after 5 timeouts in 10 s, stop calling; serve recommendations from a cached popular-items list.
- 3Bulkhead — recommendation service's ML calls on a separate 50-thread pool, not the main 200. If ML is slow, only those 50 threads block; the other 150 keep serving cached or simpler paths.
- 4Dedicated health-check thread pool on the gateway — user-request exhaustion shouldn't prevent health checks from succeeding.
- 5Graceful degradation — feed still works without fresh recommendations; serve the last successful recommendation list from a 5-minute cache.
The pattern: slow downstream → thread pool exhaustion → caller pool exhaustion → gateway exhaustion → LB marks healthy instances unhealthy → full outage. Each step is preventable with one of the four classic patterns (timeout, circuit breaker, bulkhead, fallback).
?“How do you decide circuit breaker thresholds? Too sensitive vs too slow?”Reveal
The tradeoff:
- Too sensitive (threshold 3 failures in 30s) → flaps. Single transient hiccup trips the breaker, fallback serves for 30s while service was actually fine.
- Too slow (threshold 50 failures in 5 min) → cascade starts before breaker trips. Users see errors for minutes before protection kicks in.
Three numbers to set:
1. Failure threshold — when to open.
- Expressed as a rate (>50% errors in a sliding window), not an absolute count.
- Rate accounts for traffic volume; absolute counts false-trip at low QPS and slow-trip at high QPS.
- Typical: 50% error rate over a 10-second sliding window with a minimum of 20 requests (avoid tripping on 2/3 at low volume).
2. Cooldown — how long to stay open.
- Short (5-10s) → fast recovery, but risk hammering a still-broken service.
- Long (1-2 min) → safe recovery, but users see fallback longer than needed.
- Typical: 30 seconds. Long enough for transient blips to clear, short enough for fast recovery.
3. Half-open probe rate — how aggressively to test recovery.
- Let 1 request through at a time (strict). Slow to detect recovery, safe.
- Let N% of requests through. Faster detection, more load on a recovering service.
- Typical: 1 probe per cooldown cycle. If it succeeds, close fully. If not, reopen.
Error classification matters:
- 5xx from server → count as failure.
- 4xx from server → do NOT count; client error, server is healthy.
- Timeout → count as failure (and this is usually the dominant one).
- Connection refused → count as failure.
Per-host vs per-service circuit:
- Per-service: one breaker for all instances. Simpler. Trips when the service is broadly sick.
- Per-host: one breaker per upstream instance. More granular — one bad host gets isolated while the rest of the fleet serves normally. Preferred in modern service meshes.
Production tuning: start with defaults (50% / 10s window / 30s cooldown / 1 probe), monitor trip rates, tune. False trips (breaker opens but service was healthy) → raise threshold. Slow trips (users see errors before breaker acts) → lower threshold or shorten window.
?“Your health check is returning 200 but users are seeing errors. What's going on?”Reveal
Classic shallow health check problem: the endpoint returns 200 as long as the process is running, but the process can't actually serve users.
Common causes:
1. The check doesn't exercise dependencies.
/healthreturns 200 immediately without checking DB, cache, or downstream services.- Process is alive; its dependencies are broken; users see errors.
- Fix: a
/readyendpoint that actually checks the dependencies it uses for user-serving traffic.
2. The check uses a different code path than user requests.
- Health check hits a dedicated endpoint that uses its own connection. User requests share the pool that's exhausted.
- Health check succeeds (its connection is fine); users fail (their pool is empty).
- Fix: health check should use the same pool / threadpool as user traffic, so pool exhaustion surfaces in the check.
3. The check isn't comprehensive.
/healthreturns 200 if DB is reachable. But the cache is down, and all reads are now 10x slower and timing out.- Fix: check all critical dependencies, or use a deeper "canary" request that mirrors real user traffic.
4. Degraded mode not detected.
- Service is running but in a known-bad state (e.g., serving stale data because replication broke). Health check doesn't know this.
- Fix: health check includes semantic sanity checks (data is less than N seconds old, feature flags match expected, etc.).
The layered approach (Google SRE style):
- `/healthz` (liveness): process is running. Kubernetes uses this to decide whether to restart the container.
- `/readyz` (readiness): process can serve traffic. Load balancer uses this to decide whether to route. Checks dependencies.
- `/status` (detailed): human-readable breakdown of each subsystem's health.
Plus:
- Synthetic probes from outside — hit the real user-facing endpoints every 30 seconds from a remote monitor (Datadog, Pingdom, in-house). Catches "looks healthy from inside, broken from outside" cases.
- Real-user monitoring (RUM) — client-side metrics that tell you what users actually saw. If RUM errors diverge from server-side errors, your health checks are lying.
Interview answer: "Shallow checks lie. Readiness checks exercise the same dependencies as user traffic, and synthetic probes from outside validate the full path. The lesson is: if health checks never fail in ways correlated with user complaints, the checks are wrong."
?“How do you test failure modes before they happen in production?”Reveal
Chaos engineering, applied with discipline.
Progression — build up from safe to scary:
Level 1: Local and staging injection.
- Tools: Toxiproxy (latency/drop injection), Istio/Envoy fault injection (percentage-based 500s and delays), language-level chaos libs.
- What: inject a 500 ms delay on the recommendation call; verify the circuit breaker trips and feed falls back to cached popular items within 3 seconds.
- Cost: cheap, fast iteration. No user impact.
Level 2: Staging with realistic load.
- Replay a slice of production traffic to staging.
- Inject: slow DB, kill a replica, saturate a queue, partition a network segment.
- Verify: the fallbacks and circuit breakers actually fire in the right sequence.
- Catches: config drift between dev and staging, fallbacks that work in unit tests but don't compose under load.
Level 3: Production with blast-radius caps.
- Netflix's approach: Chaos Monkey kills one instance in production during business hours. Other instances handle it.
- Scope: 1% of traffic, one AZ, one shard. Never the full fleet.
- Verify: fallbacks work under real traffic; no customer-visible impact.
Level 4: Full chaos exercises (game days).
- Scheduled, coordinated: "today we will simulate an entire AZ being lost."
- Run for 1-2 hours; verify RTO, failover, and user impact match the runbook.
- Post-mortem even though nothing "broke."
What to inject, in priority:
- 1Process kill — covers crash failures. Easy to handle well.
- 2Latency injection — covers slow failures. This is where most systems fail.
- 3Network partition — covers split-brain and coordination failures.
- 4Disk full / OOM — covers resource exhaustion. Often unhandled.
- 5Dependency unavailable — covers upstream failure.
The honest truth: most teams test crashes and declare themselves resilient. Then a slow failure hits in production and they discover their timeouts, circuit breakers, and bulkheads don't actually work as expected. The discipline is to specifically test slow failures — they're harder to handle and more common in practice.
Metrics to track during chaos:
- User-facing error rate (should stay near zero).
- Latency p95/p99 (should degrade gracefully, not spike catastrophically).
- Fallback hit rate (should increase during injection — proves fallbacks fire).
- Time to detect (should match your SLO).
- Time to recover (should match your RTO).
?“You're handling auth. Fail-open or fail-closed if the auth service is down?”Reveal
Fail-closed. Always. No exceptions on auth.
Why:
- Fail-open on auth means an unauthenticated request is treated as authenticated. That's a security vulnerability, not a resilience feature.
- Attackers will notice. "The auth service has a 30-minute outage window every month" becomes "let's exploit that to get admin access."
- Compliance (SOC 2, PCI, HIPAA) explicitly requires fail-closed on auth. Fail-open is an audit finding.
The "but availability!" argument — why it's wrong:
- "But users can't do anything without auth, so we might as well let them through." No: a user seeing 503 is a bad experience; a user whose account gets compromised because auth failed-open is a much worse experience plus a lawsuit.
- "But my auth service is unreliable." Then fix the auth service. Don't compensate for its unreliability by compromising security.
What fail-closed actually looks like:
- Auth service unreachable → return 503 with retry-after.
- Auth token validation fails (expired, signature mismatch) → reject with 401.
- Authorization check can't be made → reject with 403 (or 503 if the authz service is down; user can retry).
Making auth resilient so fail-closed is tolerable:
- 1Replicate aggressively. Auth is critical — 3+ replicas per region, multi-AZ, strict HA.
- 2Cache auth decisions aggressively in the gateway or service. A 60-second cache on "user X is valid" means auth outages only affect the first request after cache expiry.
- 3Signed tokens (JWT) that don't require auth service for validation. Token + public key = validate locally. Auth service only needed for issuing tokens, not validating.
- 4Short auth service outages tolerated via token TTL — existing tokens stay valid until expiry (typically 15 min - 1 hr). Only new logins are blocked.
What CAN fail-open (non-security):
- Rate limiters (see the rate-limiting lesson).
- Feature flags (fall back to defaults).
- Recommendations, personalization (serve generic).
- Analytics collection (drop events, don't block the user path).
The rule: if failing-open creates a security, correctness, or compliance bug, it's fail-closed. If it degrades UX but keeps the user experience safe, it's fail-open.
Code examples
type State = 'closed' | 'open' | 'half-open';
class CircuitBreaker {
private state: State = 'closed';
private failures = 0;
private total = 0;
private openedAt = 0;
constructor(
private readonly threshold = 0.5, // open at 50% error rate
private readonly windowMs = 10_000, // over a 10s window
private readonly cooldownMs = 30_000, // stay open 30s before probing
private readonly minRequests = 20, // ignore low-volume noise
) {}
async call<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.openedAt > this.cooldownMs) {
this.state = 'half-open'; // one probe allowed
} else {
return fallback(); // fail fast
}
}
try {
const result = await fn();
if (this.state === 'half-open') this.reset();
else this.recordSuccess();
return result;
} catch (err) {
this.recordFailure();
if (this.state === 'half-open') this.trip(); // probe failed
return fallback();
}
}
private recordSuccess() { this.total++; }
private recordFailure() {
this.total++;
this.failures++;
if (this.total >= this.minRequests &&
this.failures / this.total > this.threshold) {
this.trip();
}
}
private trip() { this.state = 'open'; this.openedAt = Date.now(); }
private reset() { this.state = 'closed'; this.failures = 0; this.total = 0; }
}async function retryWithJitter<T>(
fn: () => Promise<T>,
{
maxAttempts = 3,
baseDelayMs = 100,
maxDelayMs = 2_000,
isRetryable = (e: unknown) => true,
} = {},
): Promise<T> {
let lastErr: unknown;
for (let attempt = 0; attempt < maxAttempts; attempt++) {
try {
return await fn();
} catch (err) {
lastErr = err;
if (!isRetryable(err) || attempt === maxAttempts - 1) throw err;
// Full jitter: random between 0 and capped exponential delay.
// Prevents thundering-herd retry spikes from synchronized clients.
const capped = Math.min(maxDelayMs, baseDelayMs * 2 ** attempt);
const delay = Math.random() * capped;
await new Promise((r) => setTimeout(r, delay));
}
}
throw lastErr;
}
// Usage: only retry idempotent reads. Never a raw POST /charge.
await retryWithJitter(() => httpGet('/inventory/123'), {
isRetryable: (e) => isTimeout(e) || is5xx(e),
});interface Deadline { remainingMs(): number }
function deadlineFromHeader(req: Request): Deadline {
// e.g. from gRPC `grpc-timeout` or custom `x-deadline-ms` header.
const expiresAt = Number(req.headers.get('x-deadline-ms') ?? Date.now() + 1000);
return { remainingMs: () => Math.max(0, expiresAt - Date.now()) };
}
async function callDownstream<T>(
url: string,
deadline: Deadline,
// Reserve slack so the caller still has time to respond after this call.
slackMs = 50,
): Promise<T> {
const budget = deadline.remainingMs() - slackMs;
if (budget <= 0) throw new Error('deadline exceeded before call');
const ac = new AbortController();
const t = setTimeout(() => ac.abort(), budget);
try {
const res = await fetch(url, {
signal: ac.signal,
headers: { 'x-deadline-ms': String(Date.now() + budget) },
});
if (!res.ok) throw new Error(`${res.status}`);
return (await res.json()) as T;
} finally {
clearTimeout(t);
}
}Common mistakes
Every service retrying failing dependencies 3× turns one slow endpoint into a traffic amplifier. The dependency falls over for good. Fix: exponential backoff with jitter, cap total retries, circuit-break on sustained failure.
That's a claim, not a design. Which node holds the primary role? How do we detect failover? How long until writes resume? What if the network partition splits us so we think we're failing over but the original primary is still accepting writes? Name these.
Your auth service is down; do you let requests through (fail-open) or reject (fail-closed)? Default to fail-closed on anything security-critical. The exceptions are explicit and written down.
Crash failures are the easy case — you detect them fast and failover. Slow failures (GC pauses, lock contention, disk I/O saturation) are the killers because the component appears "alive" to health checks while being useless. Your timeout + circuit-breaker strategy is the only defence.
A remote call without a timeout is an unbounded resource lease. It holds a thread, a connection, and a slot in the caller's concurrency budget — potentially forever. Every remote call gets a timeout. No exceptions. Default to 200ms for cache, 1s for DB reads, 5s for external APIs.
Retrying a POST /charge without an idempotency key means the customer gets charged twice. Only retry operations that are safe to repeat. For non-idempotent operations, fail fast and return an error the client can act on.
Practice drills
Your interviewer draws a system with 4 microservices in a chain (A → B → C → D). Service D becomes slow. Describe the cascade and how to prevent it.Reveal
Without protection: D is slow → C waits on D, exhausts its thread pool → B waits on C, exhausts its pool → A waits on B → gateway times out → all users see 503s. The fix: (1) Timeout at every hop — D gets 200ms, C gets 500ms, B gets 1s, A gets 2s. (2) Circuit breaker on the C→D call — after 5 timeouts in 10s, C stops calling D and returns a fallback. (3) Bulkhead — C has separate thread pools for its D-calls vs its other work, so a slow D can't starve C's other endpoints.
What's the difference between fail-open and fail-closed? Give an example where each is correct.Reveal
Fail-open: when the check fails, let the request through. Correct for non-security features like rate limiting (better to let some extra traffic through than DoS yourself), feature flags (fall back to the default experience), and recommendations (show popular items). Fail-closed: when the check fails, reject the request. Correct for auth/authz (don't serve unauthenticated requests), payment verification (don't process unverified charges), and compliance gates.
Your cache (Redis) goes down. The DB is healthy but can handle only 10% of the normal read QPS. What do you do?Reveal
Immediate: (1) Load-shed at the gateway — reject ~90% of read traffic with 503 + Retry-After so the DB survives. (2) Prioritise critical reads (auth lookups, payment verification) on a dedicated connection pool. (3) Serve stale data from any local/in-process caches. Medium-term: bring up a new Redis instance, warm it from the DB with a background job, then re-enable traffic gradually. The anti-pattern: letting all traffic hit the DB — it dies within seconds under the full load, and now you have two outages.
How do you test failure modes before they happen in production?Reveal
Chaos engineering — inject failures in a controlled way: (1) Kill instances (Chaos Monkey). (2) Inject latency on specific calls (Toxiproxy, Istio fault injection). (3) Simulate network partitions between services. (4) Fill disks, exhaust memory, saturate CPU. (5) Fail an entire AZ. Run these in staging first, then in production with a blast radius cap (e.g. 5% of traffic). The goal: prove your circuit breakers, timeouts, and fallbacks actually work under real failure conditions.
Cheat sheet
- •Every component: what fails, blast radius, detection, degradation.
- •Slow > crash: slow failures cascade; crashes are detected fast.
- •Retries: bounded, with jitter, behind a circuit breaker. Only for idempotent ops.
- •Timeouts: every remote call. Inner timeout < outer timeout.
- •Circuit breaker: closed → open (after threshold) → half-open (probe) → closed.
- •Bulkhead: isolate thread/connection pools per dependency.
- •Fallbacks: serve stale > partial response > static default > honest error > 503.
- •Health checks: synthetic traffic, not just TCP. Check the dependency, not just the process.
- •Fail-closed on security (auth, authz). Fail-open on non-critical features.
- •Decide the fallback before the outage, not during it.
Practice this skill
These problems exercise Failure mode analysis. Try one now to apply what you just learned.