Latency budgeting
p50/p99 targets, per-hop budgets, tail latency mitigation.
A budget you don't compute is a budget you'll blow. Every synchronous hop costs milliseconds you don't get back — and tail latency isn't the average plus a bit, it's a different animal.
Read this if your last attempt…
- You didn't compute a per-hop latency budget
- You said "p99 is fine" without stating the p99
- You added network hops without budgeting them
- Your design sends serial calls when parallel would save time
The concept
A latency budget is the end-to-end target broken into per-hop allowances. Pick the user-visible target first (e.g. 200 ms p95 for a feed load), then subtract the unavoidable (client RTT, TLS handshake, edge → origin) and allocate what's left across your internal hops.
The three distinctions that matter
- p50 vs p99 vs p99.9 — p50 is what most users see; p99 is what power users feel; p99.9 is where the tail problems hide. Sizing a distributed system on p50 guarantees an unhappy p99.
- Serial vs parallel fan-out — N serial calls cost
sum(latencies); N parallel calls costmax(latencies). Any time you can parallelise independent calls, you should. - Tail amplification — if a request fans out to N backends in parallel, its latency is the max of N p99s. At N=100 with each p99 = 100 ms, overall p99 is dominated by tail stragglers and trends to p99.9 or worse.
User target → subtract client & edge → allocate across app → storage → return. Every hop needs a number, not a vibe.
Tail-latency techniques — cost vs savings.
| Technique | Saves | Costs |
|---|---|---|
| Hedged requests (fire second at p95) | p99 / p99.9 | ~5% extra load |
| Request cancellation | Steady-state load from hedging | Cancellation token plumbing |
| Parallel fan-out (fanning independent calls) | Serial latency sum | More concurrent connections |
| Timeouts at every hop | Unbounded stragglers | Must tune carefully — too tight = false fails |
| Circuit breakers | Latency under partial failure | Complexity; must exercise in chaos testing |
How interviewers grade this
- You state the user-visible target and the percentile (e.g. "p95 < 200 ms").
- Every hop in your diagram has a latency number.
- You identify the serial critical path and propose parallelisation where possible.
- You distinguish p50 sizing from p99 sizing — and size infrastructure for the higher percentile.
- You have a tail-latency plan (hedging, cancellation, circuit-breaking) when fan-out is wide.
Variants
Budget-first design
Pick user target, subtract fixed costs, allocate the rest to hops.
The discipline: start with a number, end with an allocation. Any hop that can't fit its budget is a design problem you see before launch, not after.
Pros
- +Forces a concrete constraint
- +Surfaces overbudget hops early
Cons
- −Takes 5 minutes of interview time
Choose this variant when
- Any new design
- Any performance-critical workflow
Parallel fan-out
Fire independent calls concurrently; wait for all (or quorum).
Cuts the sum to the max. But the max is a p99-on-p99 problem — the more backends, the worse the tail unless you add hedging.
tail-amplificationThat is why wide fan-out without hedging is a p99 trap: at N=100 you are effectively asking each backend to meet its p99.99, not its p99.
Pros
- +Linear latency → log(N) or constant
- +Essential for feeds, search, aggregation
Cons
- −Amplifies tail latency
- −Wastes work on cancelled calls
Choose this variant when
- Independent calls
- Aggregation / fan-out queries
Hedged requests
After p95 elapses, fire a duplicate; first to return wins.
The Jeff Dean / Google "Tail at Scale" technique. Costs ~5% extra load; saves dramatic p99 improvements in systems with a slow tail.
hedged-requestCap the hedge budget (e.g. never more than 5% of in-flight hedged) so you do not amplify the incident when every replica is slow together.
Pros
- +p99/p99.9 reduction without capacity changes
- +Works behind any replicated backend
Cons
- −Slightly higher steady load
- −Need idempotent backends or cancellation
Choose this variant when
- Replicated read path
- Fan-out > 10
- Tail is dominated by a few slow shards
Worked example
Target: feed-load p95 < 300 ms end-to-end.
- Client RTT: 50 ms (fixed).
- TLS + edge: 30 ms.
- Remaining server budget: 220 ms.
Server path (serial):
- Auth check: 5 ms (cached token).
- Feed-compose service: ~100 ms (4 parallel backend calls, max of them).
- Timeline (user tweets): p99 80 ms. - Follow graph: p99 40 ms. - Ads: p99 60 ms. - Feed rank ML: p99 100 ms.
- Hydration (parallel): 40 ms.
- Serialization + response: 10 ms.
Total server: 5 + 100 + 40 + 10 = 155 ms. Under budget with 65 ms of slack.
At wide fan-out to 4 backends, overall p99 approaches max(each p99), not their sum. If one backend regresses to 200 ms p99, the whole request p99 jumps. Add hedging on the Feed rank ML call (slowest tail) to pull p99 back.
Good vs bad answer
Interviewer probe
“What's your latency target and how did you allocate it?”
Weak answer
"It'll be fast. Redis is fast, gRPC is fast. Should be under a second."
Strong answer
"p95 < 300 ms end-to-end. After 80 ms of client + edge, server has 220 ms. Auth 5, feed compose 100 (4 parallel calls, max of them), hydration 40, serialize 10. Total 155 — 65 ms slack. Feed rank is the tail offender at p99 100 ms; I'd hedge it after 80 ms elapsed. Everything else is well-behaved."
Why it wins: Names a percentile, allocates per hop, identifies the tail offender, proposes a specific mitigation.
When it comes up
- Right after non-functional requirements — whenever you agree to a latency SLO
- During deep-dive on a read path with any fan-out or remote calls
- When the interviewer asks "how fast does this need to be?"
- When you propose adding a service, cache layer, or remote hop
- Whenever "p99" or "tail latency" enters the conversation
Order of reveal
- 1Commit to a target AND a percentile. "Let's target p95 < 300 ms for the feed load. I'll allocate a per-hop budget so we can see if the design fits before we get deep into components."
- 2Subtract the unavoidable first. "Client RTT ~50 ms and edge/TLS ~30 ms are fixed. That leaves ~220 ms for the server path."
- 3Allocate per hop and point at the slack. "Auth 5, feed compose 100, hydration 40, serialize 10 — that's 155 ms with 65 ms of slack. Every hop has a number, not a vibe."
- 4Identify the critical path and parallelise. "These three calls are independent, so they run in parallel — the cost is max(80, 40, 60) = 80, not the 180 sum."
- 5Call out the tail explicitly. "At 4-way fan-out the overall p99 is closer to the max of four p99s than to any one of them. Feed rank is the slow backend, so I'd hedge that one after 80 ms."
- 6Add timeouts and a fallback. "Every remote hop gets a timeout tighter than its budget, plus a fallback: stale cache on the feed service, degraded result on ads."
Signature phrases
- “Pick a target AND a percentile” — Prevents the "it'll be fast" hand-wave that weak candidates fall into.
- “Serial is sum, parallel is max” — One-line mental model for restructuring the call graph.
- “Wide fan-out amplifies tails” — Shows you know the Tail-at-Scale result, not just the average case.
- “Hedge the slow backend, not everything” — Signals calibrated use — hedging isn't free.
- “A hop without a timeout is a p99 disaster waiting to happen” — Concrete operational discipline.
- “Size for the percentile you committed to” — Catches the common p50-sizing mistake.
Likely follow-ups
?“Walk me through a hedged request in detail — when does the second fire, what happens if both return, and what's the downside?”Reveal
Trigger: start a timer when the first request goes out. If it hasn't returned by some quantile of the latency distribution (typically p95 of that backend), fire a second request to a different replica.
Resolution: take whichever response arrives first. Cancel the other in-flight request (tied requests take this further — the second backend cancels itself if it sees the first one is already executing).
Effect on load: in steady state only ~5% of requests spawn a hedge (since only ~5% exceed p95). The backend sees about 5% extra traffic, not double.
Downside 1 — duplicate work if backends aren't idempotent. Reads are fine; writes need careful thought or idempotency keys.
Downside 2 — correlated hedging. If all backends are slow together (GC pause, thundering herd), hedging fires everywhere at once and amplifies the incident. The fix is a cap: never more than X% of in-flight requests hedged, back off under load.
Downside 3 — metric pollution. Your "request latency" histogram now counts whichever response arrived first, which makes the underlying backend p99 harder to see. Instrument both.
?“Why does a 100-way fan-out have a p99 much worse than any individual backend's p99?”Reveal
The overall latency is max(L₁, L₂, ..., L₁₀₀) where each Lᵢ is drawn from the backend's latency distribution.
P(max < T) = P(L < T)¹⁰⁰. So for the overall p99 = max < T₉₉, we need P(L < T)¹⁰⁰ = 0.99, which means P(L < T) ≈ 0.9999 — we need each backend at its p99.99, not its p99, to get an overall p99.
Concretely: if each backend is p99 = 100 ms but p99.99 = 500 ms, the overall p99 of a 100-way fan-out is ~500 ms, not 100 ms.
Three mitigations:
- 1Reduce fan-out — aggregate at an intermediate tier so each request only spreads to e.g. 10 backends, then 10 of those aggregators merge.
- 2Hedge the slow tail — fire a duplicate to a replica once you hit the p95 of the individual distribution.
- 3Tolerate partial results — return after N-K backends respond (quorum). Works only if the missing K are tolerable in the result (e.g., search ranking).
?“How do you set the timeout for a hop? What's the right policy?”Reveal
Two constraints compete:
- Too tight → false failures on the slow tail; users see errors when the backend would have eventually answered.
- Too loose → the hop drags the whole request into its worst tail, defeating the budget.
Starting rule: timeout = backend p99.9 × 1.2, clamped to the remaining budget. Measure the p99.9 in production, don't guess.
Propagate a deadline, not just a timeout. Every request carries "you have X ms left". Each hop computes its timeout from the remaining deadline and the rest of the call graph. That way a slow early hop tightens the later hops automatically and the request either completes or fails fast.
Layer with retries carefully. Retry + timeout can multiply: a 1 s timeout with 2 retries can block up to 3 s. Either use a global deadline that caps all attempts, or set the retry budget as a percentage of the remaining deadline.
Fallback plan: every hop needs one. Stale cache, degraded result, or a cached default. A hard failure on a non-critical hop should never fail the whole request.
?“Your p95 is fine but p99 is terrible. Where do you look first?”Reveal
Signal that something happens to ~1% of requests but not the rest. Five usual suspects in order:
- 1GC pauses / JIT compilation / cold caches. Check GC logs; long pauses spike p99 without touching p50. Mitigation: tune GC (ZGC/Shenandoah for JVM), pre-warm caches on deploy.
- 1One slow shard or replica. The 1% of requests that hash to a degraded node. Check per-shard latency histograms, not just global. Fix: hedge, remove the bad replica, or rebalance.
- 1Lock / connection pool contention. Check pool wait times. If p99 wait >> p50 wait, the pool is undersized or there's a slow query holding a connection.
- 1Tail of a dependency. Your p99 is often someone else's p99.9. Drill down to the slowest hop in tracing; the bad actor is usually one call.
- 1Request size outliers. A few requests do much more work (large result sets, fat payloads). Segment the latency histogram by payload size.
Tool: distributed tracing with per-span percentiles is the fastest way to find the culprit. Global histograms only tell you p99 is bad — per-span tells you where.
?“How much budget should I spend on the database?”Reveal
A useful rule: cache hits ~1 ms, primary-key reads ~5-10 ms, indexed queries ~20-30 ms, unindexed or aggregations ~100 ms+. Anything you can push to a cache gets ~1 ms; the DB budget should cover only the cache misses.
For a 220 ms server budget with a 90% cache hit ratio:
- Hot path: 1-2 ms (cache hit, ~90% of requests).
- Cold path: 20-50 ms (one indexed DB query, ~10% of requests).
- Amortised per-request DB cost: ~0.9×1 + 0.1×30 = ~4 ms.
If your design needs more than one DB query on the hot path, either:
- 1Pre-compute and cache the composite result.
- 2Materialise a denormalised view you read in one query.
- 3Fan out to read replicas in parallel — p99 becomes max, not sum.
And if you're doing writes: writes are strictly more expensive than reads. A p95 < 100 ms on a write path usually means write-ahead log append + async materialisation, not synchronous commit-through to every index.
Code examples
// Caller sets a single end-to-end deadline; every hop reads it,
// computes its own timeout from the remaining slack, and never
// waits longer than the overall request was promised.
func HandleFeed(w http.ResponseWriter, r *http.Request) {
ctx, cancel := context.WithTimeout(r.Context(), 300*time.Millisecond)
defer cancel()
// Parallel fan-out: three independent calls share the deadline.
// Whichever finishes last determines the overall latency (max, not sum).
g, gctx := errgroup.WithContext(ctx)
var tweets, graph, ads Result
g.Go(func() error { return fetchTweets(gctx, &tweets) })
g.Go(func() error { return fetchGraph(gctx, &graph) })
g.Go(func() error { return fetchAds(gctx, &ads) })
if err := g.Wait(); err != nil {
// Deadline or subcall error: return what we have + a 206 Partial.
writePartial(w, tweets, graph, ads); return
}
writeFull(w, merge(tweets, graph, ads))
}
// Inside fetchTweets: timeout tight to the remaining slack, not a wall-clock constant.
func fetchTweets(ctx context.Context, out *Result) error {
remaining := deadlineRemaining(ctx) // e.g. 180ms
perHop := time.Duration(float64(remaining) * 0.6) // leave headroom
c, cancel := context.WithTimeout(ctx, perHop)
defer cancel()
return grpcClient.ListTimeline(c, req).Scan(out)
}// Based on "The Tail at Scale" (Dean, Barroso, CACM 2013).
// Call two replicas; second fires only if first hasn't returned by
// the backend's p95 latency. First response wins; losers cancel.
func HedgedGet(ctx context.Context, key string) (Result, error) {
const hedgeAfter = 80 * time.Millisecond // backend p95
out := make(chan Result, 2)
errs := make(chan error, 2)
c1, cancel1 := context.WithCancel(ctx); defer cancel1()
c2, cancel2 := context.WithCancel(ctx); defer cancel2()
go func() {
r, err := replicaA.Get(c1, key)
if err != nil { errs <- err; return }
out <- r
}()
timer := time.NewTimer(hedgeAfter); defer timer.Stop()
select {
case r := <-out:
// First replica won before the hedge window.
return r, nil
case <-timer.C:
// Fire the hedge. Hedge budget: cap at 5% of in-flight to
// avoid amplifying incidents when every replica is slow.
if !hedgeBudget.Allow() {
return <-waitOne(out, errs), nil
}
go func() {
r, err := replicaB.Get(c2, key)
if err != nil { errs <- err; return }
out <- r
}()
case err := <-errs:
return Result{}, err
}
r := <-out
cancel1(); cancel2() // loser is cancelled in its own RPC layer
return r, nil
}Common mistakes
The average user's experience is not the problem. Size on the percentile you've committed to — usually p95 or p99.
Three serial 50 ms calls is 150 ms; three parallel is 50 ms. Fan out anything independent.
A hop without a timeout drags the whole request into the slow tail. Every remote call needs a timeout that fits the hop's budget.
Overall p99 of a 100-way fan-out approaches the max of 100 p99s — effectively p99.99 of each backend. Hedge, cap fan-out, or pre-aggregate.
Practice drills
User target is 500 ms p99. Client RTT 100 ms, TLS 50 ms. You have 3 serial calls server-side at 80, 120, 60 ms p99. Over or under budget?Reveal
Over. 100 + 50 + 80 + 120 + 60 = 410 ms sums p99s, but serial p99 is worse than the sum of p99s (tail coincidence is rare). Assume ~410 as a lower bound; add ~20% for jitter → ~490 ms. Right at the edge. Fix: parallelise where possible (max of 120 instead of sum), or cache the slowest call.
Interviewer: "you have a 50-way fan-out; each backend is p99 = 50 ms. What's the overall p99?"Reveal
Not 50 ms. You're taking the max of 50 independent p99s, which approximately equals the p(1 − 0.01^(1/50)) ≈ p(0.02) of a single backend — way into the upper tail. Empirically the overall p99 is ~2–3× the individual p99, so 100–150 ms. Mitigations: hedge, reduce fan-out, pre-aggregate.
You add a new hop costing 20 ms p99. Your budget was already tight. What do you do?Reveal
Options in order: (1) see if it can be parallelised with an existing hop (free); (2) make it async if the result isn't needed in the response (free); (3) cache it if its inputs are stable (cheap); (4) push back on the feature until the budget is renegotiated (political, often correct).
Cheat sheet
- •Pick a target + percentile. "p95 < 300 ms", not "it'll be fast".
- •Every hop gets a budget number.
- •Serial = sum. Parallel = max. Prefer parallel for independent work.
- •Wide fan-out amplifies tails. Hedge the slow backends.
- •Timeout at every hop. Default to < budget so failures are bounded.
- •Watch p99 and p99.9 separately — they are different problems.