Observability & operations
Metrics, logs, traces, SLOs, alerting on symptoms not causes.
You cannot operate what you cannot see; you cannot page on what you cannot measure. Candidates who design beautiful systems with no metrics, no logs, and no alerts are designing systems their on-call team will hate.
Read this if your last attempt…
- Your design ended without mentioning how anyone would know it's broken
- You said "we'll have monitoring" but didn't say what
- You can't name what you'd alert on
- You confuse SLI, SLO, and SLA
The concept
Observability is three pillars and one driver.
The three pillars
- Metrics (aggregated numbers over time) — request rate, error rate, duration, queue depth. Cheap to store at scale. The foundation of alerting.
- Logs (per-event records) — a narrative of what happened. Expensive at scale — log only what you'll actually query. Prefer structured over free-text.
- Traces (causally-linked spans across services) — how did a request flow? Where was the latency? Indispensable for distributed debugging; sample aggressively (1–10%) to keep cost bounded.
Metrics aggregate, logs narrate, traces connect. SLOs are what tells you which pager goes off.
Three pillars — what each is for.
| Metrics | Logs | Traces | |
|---|---|---|---|
| Purpose | Alert + trend | Narrative + audit | Causality + latency breakdown |
| Cardinality | Low (labels bounded) | Unbounded (but sampled) | Sampled (1–10%) |
| Cost profile | Cheapest | Most expensive at scale | Medium (sampling-controlled) |
| Query shape | Time-series aggregation | Full-text + structured | Span tree per trace id |
| When to reach for it | Dashboards, alerts | Post-incident forensics | "Where did my 2 seconds go?" |
How interviewers grade this
- Every stateful component in your design has at least the four golden signals instrumented.
- You name one SLO per user journey (read, write) with a concrete target (p99 < 200 ms, availability 99.9%).
- You page on burn rate, not on raw alerts — fast-burn alerts for acute issues, slow-burn for chronic.
- You distinguish metrics (cheap, aggregated), logs (structured, sampled in volume), and traces (sampled at ~1%).
- You have a "what do I look at first at 3 AM" dashboard — the USE/RED/4-signals view for the service.
Variants
RED (request-driven services)
Rate, Errors, Duration — instrument these per endpoint.
The default instrumentation for a request-serving service. Rate = QPS, Errors = error ratio or rate, Duration = latency distribution (p50/p95/p99). Dashboard these per-endpoint and per-service.
Pros
- +Directly maps to user experience
- +Three numbers cover 80% of diagnostic questions
- +Standard across every service
Cons
- −Says nothing about resource saturation
- −Needs a per-endpoint histogram (costs cardinality)
Choose this variant when
- Request-serving HTTP/gRPC services
- Anything user-facing
USE (resources / infra)
Utilization, Saturation, Errors — instrument these per resource.
The complement to RED for infrastructure: CPU, memory, disk, network, pools. Utilization = % busy, Saturation = extent of queued work (load avg, queue depth), Errors = device-level errors.
Pros
- +Surfaces capacity problems before they become user-visible
- +Standard across every resource
Cons
- −Not directly tied to user impact
- −Alerting on raw saturation leads to noisy alerts
Choose this variant when
- Databases, caches, queues
- VMs, containers, nodes
Four golden signals
Latency, traffic, errors, saturation — Google SRE's canonical set.
Two windows, two severities. Fast burn catches acute incidents; slow burn catches chronic drift. No paging on raw CPU or error counts.
A synthesis of RED + USE. The four things any service must emit: latency of successful requests, traffic volume, error rate, and how "full" the service is. Treat this as the minimum viable dashboard for every service you design.
burn-rate-alertsPair the four signals with multi-window burn-rate alerts above — the signals are what you dashboard, the burn rates are what you page on.
Pros
- +Universal
- +Covers both request-path and resource-path concerns
- +Aligns with SLO definitions
Cons
- −Requires some judgment to define "saturation" per service
Choose this variant when
- Any service
- Default starting point
Worked example
Scenario: A payments API. You've designed the HLD; now make it operable.
SLOs (per user-visible journey):
- POST /charge — availability 99.95% (≈ 21 min/month budget), latency p99 < 500 ms.
- GET /charge/:id — availability 99.99%, latency p99 < 100 ms.
Metrics (RED per endpoint, USE per resource):
- rate: http_requests_total{endpoint, status}
- errors: http_requests_errors_total / http_requests_total
- duration: http_request_duration_seconds_bucket (histogram)
- DB: db_connections_in_use, db_query_duration_seconds, db_replication_lag_seconds
- Queue: queue_depth, queue_consumer_lag_seconds
Alerts (burn-rate-based):
- Fast burn: 2% of monthly budget consumed in 1h → page immediately.
- Slow burn: 5% of monthly budget consumed in 6h → ticket, investigate next business day.
- No "CPU > 80%" page — that's a cause, not a symptom. Page on SLO burn.
Logs (structured, sampled at high volume):
- One structured log per request: {trace_id, endpoint, status, duration_ms, user_id, amount}.
- Error logs un-sampled; 2xx logs sampled at 1% at high QPS.
Traces:
- OpenTelemetry, 1% sampling, always-sample on errors.
- Span attributes: db.statement (hashed), queue.key, external_call.target.
Head sampling keeps cost bounded (1% of all traces). Tail sampling after the collector always keeps error traces and slow-tail traces — the ones you actually need at 3 AM.
The 3 AM dashboard:
- 1Four-signals panel for POST /charge + GET /charge/:id.
- 2Error-budget burn for both SLOs.
- 3DB replication lag + queue depth — the two non-obvious leading indicators.
- 4Recent deploys overlaid on the error-rate graph.
Good vs bad answer
Interviewer probe
“How will you know the system is broken?”
Weak answer
"We'll have Datadog or Prometheus, and we'll set up alerts on CPU, memory, and errors."
Strong answer
"Every service emits RED (rate, errors, duration) per endpoint and USE (utilization, saturation, errors) per resource. The user-facing SLO is p99 < 200 ms at 99.9% availability, so we alert on SLO burn rate: fast-burn (2% of 30-day budget in 1h) pages immediately; slow-burn (5% in 6h) opens a ticket. Saturation metrics are on the dashboard but not paged directly — they cause issues that show up as SLO burns, and we look at them during triage. Traces are sampled at 1% with error-sampling always on, so when we hit a latency spike at 3 AM, we can pick a slow trace and see where the time went."
Why it wins: Names the instrumentation discipline (RED/USE), the SLO with concrete numbers, the alerting philosophy (symptoms over causes), and how the three pillars fit together in practice.
When it comes up
- When the interviewer asks "how will you know it's broken?"
- Right after you finish the HLD — "how do we operate this?"
- When SLOs, SLAs, availability, or uptime are discussed
- When a failure scenario is proposed and you need to detect it
- In senior/staff rounds where operability is a core signal
Order of reveal
- 1Name the three pillars and the driver. "Three pillars — metrics, logs, traces — and one driver: SLOs. Metrics aggregate and alert, logs narrate, traces connect, SLOs decide when the pager fires."
- 2Pick an instrumentation pattern. "RED per request-driven service (rate, errors, duration), USE per resource (utilization, saturation, errors). Together they cover user experience and capacity."
- 3Define an SLO with numbers. "One user-visible SLO per journey. For the checkout API: 99.9% of POST /charge requests return 2xx with duration < 500 ms, measured over a rolling 28 days."
- 4Commit to burn-rate alerting. "Alerts fire on error-budget burn, not raw error counts. Fast-burn (2% of monthly budget in 1h) pages; slow-burn (5% in 6h) opens a ticket. No paging on CPU."
- 5Control cardinality and cost. "Metric labels stay low-cardinality (endpoint, status, region). High-cardinality attributes like user_id go in logs and traces. Traces sampled at 1% with error-always-on."
- 6Describe the 3 AM dashboard. "One dashboard per service: four golden signals, error-budget burn, the top 2-3 leading indicators (DB replication lag, queue consumer lag), recent deploys overlaid on the error graph."
Signature phrases
- “Page on symptoms, not causes” — The single most important modern alerting principle.
- “SLO, SLI, SLA — different things, in that order of precision” — Prevents the common conflation.
- “Metric labels are low-cardinality; logs carry the user_id” — Shows cost-awareness at scale.
- “Every service, four golden signals” — One-line definition of the instrumentation floor.
- “Traces sampled at 1%, always on errors” — Hits the cost vs coverage tradeoff precisely.
- “Error budget is permission to deploy, not a failure mode” — Shows you understand the SLO culture, not just the math.
Likely follow-ups
?“Write an SLO for a checkout API from scratch. Name every part.”Reveal
Service: POST /v1/charge — customer-initiated payment endpoint.
SLI (Service Level Indicator — the measurement): Fraction of valid POST /v1/charge requests that:
- Return a 2xx or 3xx response, AND
- Complete server-side in under 500 ms.
"Valid" excludes: requests that are rejected for auth/fraud (those are successful rejections), and requests that the client aborted before we could respond. Measured at the load balancer, not inside the service, so network failures count.
SLO (Service Level Objective — the target): ≥ 99.9% of valid requests meet the SLI over a rolling 28-day window.
Error budget: 0.1% × 28 days = ~40 minutes of full-outage-equivalent budget per month.
Alerts (multi-window burn rate, Google SRE style):
- Fast burn: consuming > 2% of monthly budget in 1h → page immediately. Means ~14.4× burn rate; extrapolates to budget exhaustion in < 2 days.
- Slow burn: > 5% in 6h → ticket, investigate same business day. ~6× burn.
- Chronic burn: > 10% in 3 days → design review; system is structurally over budget.
Why this form:
- Duration in the SLI — "available" doesn't help if p99 is 10 seconds.
- Rolling window, not calendar month — prevents gaming (deploy at end of month when budget is fresh).
- Multi-window alerting — catches both acute incidents (fast burn) and slow degradation (chronic burn), reducing false pages.
SLA (separate from SLO): if there's an external commitment (e.g. B2B contract with refund credits), set it looser than the SLO — 99.5% is typical with a 99.9% internal SLO, so you have internal margin before customers invoke the SLA.
?“The service is green but users are complaining. Where do you look?”Reveal
Five common blind spots, in order of frequency:
1. Server-side success, client-side failure. You're measuring 2xx responses from the LB; the client saw a timeout, TLS error, or TCP reset before the response arrived. Fix: add real-user monitoring (RUM — Sentry, Datadog RUM) or synthetic probes from outside your VPC. Compare server SLI to client SLI.
2. p99 looks OK but p99.9 is terrible. A small cohort hits a slow path (hot key, specific shard, flaky downstream). The global p99 metric averages out their experience. Fix: segment latency histograms by user tier, tenant, or shard. Look at per-segment p99.
3. "Successful" responses that are actually errors. Your API returns 200 with {"error": "insufficient_funds"}. The SLI counts these as success; the user sees a failure. Fix: classify semantically, not by HTTP status. Define success/failure in application terms.
4. Downstream dependency is slow-but-succeeding. Your service is healthy; it's waiting on a 3rd-party payment processor that takes 10 s instead of 500 ms. Metrics at your service look fine because they're scoped to your code. Fix: track downstream-call latency separately; add it as an SLI for flows that depend on it.
5. Recent deploy. The dashboard doesn't overlay deploys. Something shipped 10 minutes before the complaints started. Fix: annotate dashboards with deploy markers. First question in any incident: "what just changed?"
The meta-fix: run a quarterly "blind spot drill." Take one real user complaint, reproduce the signal from server-side metrics alone. If you can't, that's your next investment.
?“Explain the cost difference between metrics, logs, and traces at scale. How do you control each?”Reveal
At 100K QPS, rough annual cost at typical SaaS pricing:
Metrics — cheapest. Time-series with bounded label cardinality. Cost is proportional to (series count × retention). Per-endpoint p99 at 100 endpoints × 5 services × 10 labels = ~5K series — a few $100/mo.
- Controls: keep labels low-cardinality (no user_id in a label, ever). Pre-aggregate at the agent. Drop unused series.
Logs — most expensive. Per-event storage + full-text indexing. At 100K QPS with 1 KB logs × 30-day retention = ~260 TB × index overhead. Tens of $K/mo on managed platforms.
- Controls: structured logs only. Sample 2xx logs at 1-5%, keep errors at 100%. Short retention for verbose logs (7 d) vs audit logs (years). Separate hot (queryable) from cold (S3) tiers.
Traces — medium, controllable. Per-span storage, sampled. At 1% sampling × 10 spans per request × 100K QPS = 10K spans/sec × 500 B = ~5 GB/sec, much less with compression. A few $K/mo at 1% sampling.
- Controls: head-based sampling (sample at the edge, commit from there) + tail-based sampling (always keep traces where any span errored, duration > threshold, or specific labels).
Rule of thumb:
- If you'd chart it on a dashboard → metric.
- If you'd query it with "what happened?" → log.
- If you'd ask "where did the time go?" → trace.
Common anti-pattern: using logs as metrics. Grepping for "error" in 260 TB of logs is 100-1000× more expensive than a counter. If you query it regularly, promote it to a metric.
?“What's wrong with alerting on "CPU > 80% for 5 minutes"?”Reveal
Three problems:
1. CPU high isn't necessarily user-impacting. A batch job, a warmup, a cache rehydration can all spike CPU without affecting latency or error rate. Paging on this wakes on-call for something the user never saw. After a few of those, on-call mutes the alert — and misses the real one.
2. CPU normal isn't necessarily user-healthy. You can be at 30% CPU while a downstream dependency times out and your users see 10-second latencies. CPU alerts don't catch this class of problem at all.
3. The threshold is a guess. Why 80%? Why not 70%? Why not 90%? The right threshold depends on workload, which changes over time. Every CPU alert I've seen in production was set once and never re-tuned.
The better pattern — symptoms over causes:
- Page on SLO burn (user-visible latency or error rate). That's the symptom the user experiences.
- Use CPU / memory / saturation in triage dashboards — when the SLO alert fires, the on-call looks at saturation to understand why.
- Alert on saturation only for capacity forecasting, not paging. If a resource will exhaust in N hours given current growth, ticket the capacity team.
The Google SRE phrasing: "Metrics that tell you a user experienced a problem page; metrics that help you figure out why get dashboarded."
The one exception: hard ceilings that will cause a user-visible outage soon (disk > 90% full, connection pool > 95%). Page on these because the symptom hasn't materialised yet but will. Treat them as leading indicators, not resource utilisation.
?“Your team doesn't have SLOs yet. Where do you start?”Reveal
Phase 1 — measure what exists (1-2 weeks). Before setting a target, instrument:
- Per-endpoint request rate, error rate, latency histogram. RED method.
- The three or four most-user-impactful journeys (checkout, login, main feed).
- Client-side signals if possible (RUM). Server-side lies about what users see.
Phase 2 — establish baseline (2-4 weeks). Look at current p50/p95/p99 and error ratios. Don't set SLOs yet. Understand the shape of the distribution — especially the tails.
Phase 3 — draft candidate SLOs. For each top journey, propose:
- An SLI definition (what counts as success, with duration).
- A target based on current performance minus a small margin. If today's p99 is 300 ms at 99.95%, propose 99.9% p99 < 500 ms. Tighter than reality breaks on day 1; much looser than reality wastes budget.
- A 28-day rolling window.
Phase 4 — socialise. Show the SLO to product, on-call, and leadership. Every SLO involves a tradeoff (tighter SLO = more engineering, less feature velocity). Get explicit buy-in.
Phase 5 — ship alerting. Multi-window burn-rate alerts. Start noisy; tune down once false-positive rate is understood.
Phase 6 — iterate. Quarterly SLO review. If budget is always 100% consumed → SLO is too tight or service is structurally under-invested. If budget is always 100% remaining → SLO is too loose, customers are happier than you're measuring.
Anti-pattern to avoid: setting an SLO before you can measure it. "99.99%" on a system with no histograms is aspiration, not operations.
Code examples
# Google SRE workbook ch. 5 "Alerting on SLOs".
# SLO: 99.9% success. Error budget = 0.1%.
groups:
- name: checkout-slo
rules:
# Fast-burn: 2% of 30-day budget consumed in 1h -> page.
- alert: CheckoutSLOFastBurn
expr: |
(
sum(rate(http_requests_errors_total{service="checkout"}[1h]))
/
sum(rate(http_requests_total{service="checkout"}[1h]))
) > (14.4 * 0.001)
for: 2m
labels: { severity: page }
annotations:
summary: "Checkout burning SLO budget fast (1h window)"
# Slow-burn: 5% of budget in 6h -> ticket, investigate same day.
- alert: CheckoutSLOSlowBurn
expr: |
(
sum(rate(http_requests_errors_total{service="checkout"}[6h]))
/
sum(rate(http_requests_total{service="checkout"}[6h]))
) > (6 * 0.001)
for: 15m
labels: { severity: ticket }
annotations:
summary: "Checkout burning SLO budget slowly (6h window)"# Keeps cost bounded at 1% baseline while ALWAYS retaining the
# traces you actually need at 3 AM: errors and slow-tail latency.
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
expected_new_traces_per_sec: 10
policies:
- name: errors-always
type: status_code
status_code: { status_codes: [ERROR] }
- name: slow-tail
type: latency
latency: { threshold_ms: 500 }
- name: baseline
type: probabilistic
probabilistic: { sampling_percentage: 1 }Common mistakes
"CPU > 80%" fires at 2 AM and turns out to be a benign batch job. The on-call wakes up, sees no user impact, mutes the alert — and then misses the real one. Page on user-visible symptoms (SLO burn, error rate); use resource metrics for triage, not for paging.
Logging per-request with user_id and request_id in the metric labels explodes Prometheus cardinality. Keep metric labels low-cardinality (endpoint, status, region); put high-cardinality attributes in logs and traces.
"99.9% uptime" doesn't tell you anything without a definition of "up". Define an SLI (a measurable indicator — e.g. fraction of 2xx responses with duration < 200 ms) and an SLO (target over a window) before the system ships. "Uptime" without this is vibes.
Grepping logs to count errors is fine at low scale and catastrophic at high scale. If you'd query it often, it's a metric — a counter or histogram. Logs are for narrative, not aggregation.
Sampling every trace at 1M QPS melts the trace backend and your storage bill. Sample at 1–10%, with tail sampling to always keep traces of errors and slow requests.
Practice drills
Write an SLO for a checkout API. Include SLI, target, window.Reveal
SLI: fraction of POST /checkout requests returning 2xx/3xx with server-side duration < 500 ms, measured from the load balancer. SLO: ≥ 99.9% of qualifying requests meet the SLI over a rolling 28-day window. Error budget: 0.1% × 28d × traffic ≈ 40 minutes of full outage equivalent. Fast-burn alert: consuming > 2% of budget in 1h → page. Slow-burn: > 10% in 72h → ticket.
Interviewer: "what's the difference between an SLI, an SLO, and an SLA?"Reveal
SLI (Service Level Indicator) is the measurement — a metric you compute, e.g. success ratio. SLO (Service Level Objective) is the internal target you promise your own org — e.g. SLI ≥ 99.9% over 28 days. SLA (Service Level Agreement) is the external commitment with contractual teeth — usually looser than the SLO (e.g. SLO 99.9%, SLA 99.5%) so you have internal margin before you owe customers credits.
Your service is up but users are complaining. Metrics are green. What are the likely blind spots?Reveal
(1) You're measuring server-side success but client-side retries or network errors are invisible — add RUM or synthetic checks. (2) p99 looks fine but p99.9 is terrible — a minority of users hit a slow path (hot key, specific shard). (3) Downstream dependency is slow-but-succeeding — your SLI should include duration, not just error rate. (4) Errors are buried in 200-with-JSON-error responses — classify semantically, not by HTTP status alone.
Cheat sheet
- •Three pillars: metrics (aggregate), logs (narrate), traces (connect).
- •Four golden signals: latency, traffic, errors, saturation.
- •RED per request-driven service; USE per resource.
- •SLO = target on an SLI; error budget = 1 − SLO, over a window (28–30 days).
- •Alert on burn rate (fast + slow), not raw error counts or CPU.
- •Metric labels = low cardinality. High-cardinality data → logs / traces.
- •Trace sampling: 1–10%, always-sample on errors and slow tails.
Practice this skill
These problems exercise Observability & operations. Try one now to apply what you just learned.