advancedreliability

Observability & operations

Metrics, logs, traces, SLOs, alerting on symptoms not causes.

You cannot operate what you cannot see; you cannot page on what you cannot measure. Candidates who design beautiful systems with no metrics, no logs, and no alerts are designing systems their on-call team will hate.

Read this if your last attempt…

Your design ended without mentioning how anyone would know it's broken
You said "we'll have monitoring" but didn't say what
You can't name what you'd alert on
You confuse SLI, SLO, and SLA

The concept

Observability is three pillars and one driver.

The three pillars

Metrics (aggregated numbers over time) — request rate, error rate, duration, queue depth. Cheap to store at scale. The foundation of alerting.
Logs (per-event records) — a narrative of what happened. Expensive at scale — log only what you'll actually query. Prefer structured over free-text.
Traces (causally-linked spans across services) — how did a request flow? Where was the latency? Indispensable for distributed debugging; sample aggressively (1–10%) to keep cost bounded.

Architecture diagram· The three pillars + one driver

Metrics aggregate, logs narrate, traces connect. SLOs are what tells you which pager goes off.

Three pillars — what each is for.

Metrics	Logs	Traces
Purpose	Alert + trend	Narrative + audit	Causality + latency breakdown
Cardinality	Low (labels bounded)	Unbounded (but sampled)	Sampled (1–10%)
Cost profile	Cheapest	Most expensive at scale	Medium (sampling-controlled)
Query shape	Time-series aggregation	Full-text + structured	Span tree per trace id
When to reach for it	Dashboards, alerts	Post-incident forensics	"Where did my 2 seconds go?"

How interviewers grade this

Every stateful component in your design has at least the four golden signals instrumented.
You name one SLO per user journey (read, write) with a concrete target (p99 < 200 ms, availability 99.9%).
You page on burn rate, not on raw alerts — fast-burn alerts for acute issues, slow-burn for chronic.
You distinguish metrics (cheap, aggregated), logs (structured, sampled in volume), and traces (sampled at ~1%).
You have a "what do I look at first at 3 AM" dashboard — the USE/RED/4-signals view for the service.

Variants

RED (request-driven services)

Rate, Errors, Duration — instrument these per endpoint.

The default instrumentation for a request-serving service. Rate = QPS, Errors = error ratio or rate, Duration = latency distribution (p50/p95/p99). Dashboard these per-endpoint and per-service.

Pros

+Directly maps to user experience
+Three numbers cover 80% of diagnostic questions
+Standard across every service

Cons

−Says nothing about resource saturation
−Needs a per-endpoint histogram (costs cardinality)

Choose this variant when

Request-serving HTTP/gRPC services
Anything user-facing

USE (resources / infra)

Utilization, Saturation, Errors — instrument these per resource.

The complement to RED for infrastructure: CPU, memory, disk, network, pools. Utilization = % busy, Saturation = extent of queued work (load avg, queue depth), Errors = device-level errors.

Pros

+Surfaces capacity problems before they become user-visible
+Standard across every resource

Cons

−Not directly tied to user impact
−Alerting on raw saturation leads to noisy alerts

Choose this variant when

Databases, caches, queues
VMs, containers, nodes

Four golden signals

Latency, traffic, errors, saturation — Google SRE's canonical set.

Architecture diagram· Multi-window burn-rate alerting

Two windows, two severities. Fast burn catches acute incidents; slow burn catches chronic drift. No paging on raw CPU or error counts.

A synthesis of RED + USE. The four things any service must emit: latency of successful requests, traffic volume, error rate, and how "full" the service is. Treat this as the minimum viable dashboard for every service you design.

Missing diagram: burn-rate-alerts

Pair the four signals with multi-window burn-rate alerts above — the signals are what you dashboard, the burn rates are what you page on.

Pros

+Universal
+Covers both request-path and resource-path concerns
+Aligns with SLO definitions

Cons

−Requires some judgment to define "saturation" per service

Choose this variant when

Any service
Default starting point

Worked example

Scenario: A payments API. You've designed the HLD; now make it operable.

SLOs (per user-visible journey):

POST /charge — availability 99.95% (≈ 21 min/month budget), latency p99 < 500 ms.
GET /charge/:id — availability 99.99%, latency p99 < 100 ms.

Metrics (RED per endpoint, USE per resource):

rate: http_requests_total{endpoint, status}
errors: http_requests_errors_total / http_requests_total
duration: http_request_duration_seconds_bucket (histogram)
DB: db_connections_in_use, db_query_duration_seconds, db_replication_lag_seconds
Queue: queue_depth, queue_consumer_lag_seconds

Alerts (burn-rate-based):

Fast burn: 2% of monthly budget consumed in 1h → page immediately.
Slow burn: 5% of monthly budget consumed in 6h → ticket, investigate next business day.
No "CPU > 80%" page — that's a cause, not a symptom. Page on SLO burn.

Logs (structured, sampled at high volume):

One structured log per request: {trace_id, endpoint, status, duration_ms, user_id, amount}.
Error logs un-sampled; 2xx logs sampled at 1% at high QPS.

Traces:

OpenTelemetry, 1% sampling, always-sample on errors.
Span attributes: db.statement (hashed), queue.key, external_call.target.

Architecture diagram· Head + tail trace sampling

Head sampling keeps cost bounded (1% of all traces). Tail sampling after the collector always keeps error traces and slow-tail traces — the ones you actually need at 3 AM.

The 3 AM dashboard:

1Four-signals panel for POST /charge + GET /charge/:id.
2Error-budget burn for both SLOs.
3DB replication lag + queue depth — the two non-obvious leading indicators.
4Recent deploys overlaid on the error-rate graph.

Good vs bad answer

Interviewer probe

“How will you know the system is broken?”

Weak answer

"We'll have Datadog or Prometheus, and we'll set up alerts on CPU, memory, and errors."

Strong answer

"Every service emits RED (rate, errors, duration) per endpoint and USE (utilization, saturation, errors) per resource. The user-facing SLO is p99 < 200 ms at 99.9% availability, so we alert on SLO burn rate: fast-burn (2% of 30-day budget in 1h) pages immediately; slow-burn (5% in 6h) opens a ticket. Saturation metrics are on the dashboard but not paged directly — they cause issues that show up as SLO burns, and we look at them during triage. Traces are sampled at 1% with error-sampling always on, so when we hit a latency spike at 3 AM, we can pick a slow trace and see where the time went."

Why it wins: Names the instrumentation discipline (RED/USE), the SLO with concrete numbers, the alerting philosophy (symptoms over causes), and how the three pillars fit together in practice.

Interview playbook1-2 minutes at the end of HLD; ~2 minutes in any failure-mode deep-dive to address "how do you detect this?"

When it comes up

When the interviewer asks "how will you know it's broken?"
Right after you finish the HLD — "how do we operate this?"
When SLOs, SLAs, availability, or uptime are discussed
When a failure scenario is proposed and you need to detect it
In senior/staff rounds where operability is a core signal

Order of reveal

1
Name the three pillars and the driver. "Three pillars — metrics, logs, traces — and one driver: SLOs. Metrics aggregate and alert, logs narrate, traces connect, SLOs decide when the pager fires."
2
Pick an instrumentation pattern. "RED per request-driven service (rate, errors, duration), USE per resource (utilization, saturation, errors). Together they cover user experience and capacity."
3
Define an SLO with numbers. "One user-visible SLO per journey. For the checkout API: 99.9% of POST /charge requests return 2xx with duration < 500 ms, measured over a rolling 28 days."
4
Commit to burn-rate alerting. "Alerts fire on error-budget burn, not raw error counts. Fast-burn (2% of monthly budget in 1h) pages; slow-burn (5% in 6h) opens a ticket. No paging on CPU."
5
Control cardinality and cost. "Metric labels stay low-cardinality (endpoint, status, region). High-cardinality attributes like user_id go in logs and traces. Traces sampled at 1% with error-always-on."
6
Describe the 3 AM dashboard. "One dashboard per service: four golden signals, error-budget burn, the top 2-3 leading indicators (DB replication lag, queue consumer lag), recent deploys overlaid on the error graph."

Signature phrases

“Page on symptoms, not causes”

“SLO, SLI, SLA — different things, in that order of precision”

“Metric labels are low-cardinality; logs carry the user_id”

“Every service, four golden signals”

“Traces sampled at 1%, always on errors”

“Error budget is permission to deploy, not a failure mode”

“Page on symptoms, not causes” — The single most important modern alerting principle.
“SLO, SLI, SLA — different things, in that order of precision” — Prevents the common conflation.
“Metric labels are low-cardinality; logs carry the user_id” — Shows cost-awareness at scale.
“Every service, four golden signals” — One-line definition of the instrumentation floor.
“Traces sampled at 1%, always on errors” — Hits the cost vs coverage tradeoff precisely.
“Error budget is permission to deploy, not a failure mode” — Shows you understand the SLO culture, not just the math.

Likely follow-ups

?“Write an SLO for a checkout API from scratch. Name every part.”Reveal

Service: POST /v1/charge — customer-initiated payment endpoint.

SLI (Service Level Indicator — the measurement): Fraction of valid POST /v1/charge requests that:

Return a 2xx or 3xx response, AND
Complete server-side in under 500 ms.

"Valid" excludes: requests that are rejected for auth/fraud (those are successful rejections), and requests that the client aborted before we could respond. Measured at the load balancer, not inside the service, so network failures count.

SLO (Service Level Objective — the target): ≥ 99.9% of valid requests meet the SLI over a rolling 28-day window.

Error budget: 0.1% × 28 days = ~40 minutes of full-outage-equivalent budget per month.

Alerts (multi-window burn rate, Google SRE style):

Fast burn: consuming > 2% of monthly budget in 1h → page immediately. Means ~14.4× burn rate; extrapolates to budget exhaustion in < 2 days.
Slow burn: > 5% in 6h → ticket, investigate same business day. ~6× burn.
Chronic burn: > 10% in 3 days → design review; system is structurally over budget.

Why this form:

Duration in the SLI — "available" doesn't help if p99 is 10 seconds.
Rolling window, not calendar month — prevents gaming (deploy at end of month when budget is fresh).
Multi-window alerting — catches both acute incidents (fast burn) and slow degradation (chronic burn), reducing false pages.

SLA (separate from SLO): if there's an external commitment (e.g. B2B contract with refund credits), set it looser than the SLO — 99.5% is typical with a 99.9% internal SLO, so you have internal margin before customers invoke the SLA.

?“The service is green but users are complaining. Where do you look?”Reveal

Five common blind spots, in order of frequency:

1. Server-side success, client-side failure. You're measuring 2xx responses from the LB; the client saw a timeout, TLS error, or TCP reset before the response arrived. Fix: add real-user monitoring (RUM — Sentry, Datadog RUM) or synthetic probes from outside your VPC. Compare server SLI to client SLI.

2. p99 looks OK but p99.9 is terrible. A small cohort hits a slow path (hot key, specific shard, flaky downstream). The global p99 metric averages out their experience. Fix: segment latency histograms by user tier, tenant, or shard. Look at per-segment p99.

3. "Successful" responses that are actually errors. Your API returns 200 with {"error": "insufficient_funds"}. The SLI counts these as success; the user sees a failure. Fix: classify semantically, not by HTTP status. Define success/failure in application terms.

4. Downstream dependency is slow-but-succeeding. Your service is healthy; it's waiting on a 3rd-party payment processor that takes 10 s instead of 500 ms. Metrics at your service look fine because they're scoped to your code. Fix: track downstream-call latency separately; add it as an SLI for flows that depend on it.

5. Recent deploy. The dashboard doesn't overlay deploys. Something shipped 10 minutes before the complaints started. Fix: annotate dashboards with deploy markers. First question in any incident: "what just changed?"

The meta-fix: run a quarterly "blind spot drill." Take one real user complaint, reproduce the signal from server-side metrics alone. If you can't, that's your next investment.

?“Explain the cost difference between metrics, logs, and traces at scale. How do you control each?”Reveal

At 100K QPS, rough annual cost at typical SaaS pricing:

Metrics — cheapest. Time-series with bounded label cardinality. Cost is proportional to (series count × retention). Per-endpoint p99 at 100 endpoints × 5 services × 10 labels = ~5K series — a few $100/mo.

Controls: keep labels low-cardinality (no user_id in a label, ever). Pre-aggregate at the agent. Drop unused series.

Logs — most expensive. Per-event storage + full-text indexing. At 100K QPS with 1 KB logs × 30-day retention = ~260 TB × index overhead. Tens of $K/mo on managed platforms.

Controls: structured logs only. Sample 2xx logs at 1-5%, keep errors at 100%. Short retention for verbose logs (7 d) vs audit logs (years). Separate hot (queryable) from cold (S3) tiers.

Traces — medium, controllable. Per-span storage, sampled. At 1% sampling × 10 spans per request × 100K QPS = 10K spans/sec × 500 B = ~5 GB/sec, much less with compression. A few $K/mo at 1% sampling.

Controls: head-based sampling (sample at the edge, commit from there) + tail-based sampling (always keep traces where any span errored, duration > threshold, or specific labels).

Rule of thumb:

If you'd chart it on a dashboard → metric.
If you'd query it with "what happened?" → log.
If you'd ask "where did the time go?" → trace.

Common anti-pattern: using logs as metrics. Grepping for "error" in 260 TB of logs is 100-1000× more expensive than a counter. If you query it regularly, promote it to a metric.

?“What's wrong with alerting on "CPU > 80% for 5 minutes"?”Reveal

Three problems:

1. CPU high isn't necessarily user-impacting. A batch job, a warmup, a cache rehydration can all spike CPU without affecting latency or error rate. Paging on this wakes on-call for something the user never saw. After a few of those, on-call mutes the alert — and misses the real one.

2. CPU normal isn't necessarily user-healthy. You can be at 30% CPU while a downstream dependency times out and your users see 10-second latencies. CPU alerts don't catch this class of problem at all.

3. The threshold is a guess. Why 80%? Why not 70%? Why not 90%? The right threshold depends on workload, which changes over time. Every CPU alert I've seen in production was set once and never re-tuned.

The better pattern — symptoms over causes:

Page on SLO burn (user-visible latency or error rate). That's the symptom the user experiences.
Use CPU / memory / saturation in triage dashboards — when the SLO alert fires, the on-call looks at saturation to understand why.
Alert on saturation only for capacity forecasting, not paging. If a resource will exhaust in N hours given current growth, ticket the capacity team.

The Google SRE phrasing: "Metrics that tell you a user experienced a problem page; metrics that help you figure out why get dashboarded."

The one exception: hard ceilings that will cause a user-visible outage soon (disk > 90% full, connection pool > 95%). Page on these because the symptom hasn't materialised yet but will. Treat them as leading indicators, not resource utilisation.

?“Your team doesn't have SLOs yet. Where do you start?”Reveal

Phase 1 — measure what exists (1-2 weeks). Before setting a target, instrument:

Per-endpoint request rate, error rate, latency histogram. RED method.
The three or four most-user-impactful journeys (checkout, login, main feed).
Client-side signals if possible (RUM). Server-side lies about what users see.

Phase 2 — establish baseline (2-4 weeks). Look at current p50/p95/p99 and error ratios. Don't set SLOs yet. Understand the shape of the distribution — especially the tails.

Phase 3 — draft candidate SLOs. For each top journey, propose:

An SLI definition (what counts as success, with duration).
A target based on current performance minus a small margin. If today's p99 is 300 ms at 99.95%, propose 99.9% p99 < 500 ms. Tighter than reality breaks on day 1; much looser than reality wastes budget.
A 28-day rolling window.

Phase 4 — socialise. Show the SLO to product, on-call, and leadership. Every SLO involves a tradeoff (tighter SLO = more engineering, less feature velocity). Get explicit buy-in.

Phase 5 — ship alerting. Multi-window burn-rate alerts. Start noisy; tune down once false-positive rate is understood.

Phase 6 — iterate. Quarterly SLO review. If budget is always 100% consumed → SLO is too tight or service is structurally under-invested. If budget is always 100% remaining → SLO is too loose, customers are happier than you're measuring.

Anti-pattern to avoid: setting an SLO before you can measure it. "99.99%" on a system with no histograms is aspiration, not operations.

Code examples

yamlPrometheus multi-window burn-rate alert (fast + slow)

# Google SRE workbook ch. 5 "Alerting on SLOs".
# SLO: 99.9% success. Error budget = 0.1%.
groups:
  - name: checkout-slo
    rules:
      # Fast-burn: 2% of 30-day budget consumed in 1h -> page.
      - alert: CheckoutSLOFastBurn
        expr: |
          (
            sum(rate(http_requests_errors_total{service="checkout"}[1h]))
            /
            sum(rate(http_requests_total{service="checkout"}[1h]))
          ) > (14.4 * 0.001)
        for: 2m
        labels: { severity: page }
        annotations:
          summary: "Checkout burning SLO budget fast (1h window)"

      # Slow-burn: 5% of budget in 6h -> ticket, investigate same day.
      - alert: CheckoutSLOSlowBurn
        expr: |
          (
            sum(rate(http_requests_errors_total{service="checkout"}[6h]))
            /
            sum(rate(http_requests_total{service="checkout"}[6h]))
          ) > (6 * 0.001)
        for: 15m
        labels: { severity: ticket }
        annotations:
          summary: "Checkout burning SLO budget slowly (6h window)"

yamlOpenTelemetry Collector — tail sampling (always-errors + slow-tail + 1% baseline)

# Keeps cost bounded at 1% baseline while ALWAYS retaining the
# traces you actually need at 3 AM: errors and slow-tail latency.
processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    expected_new_traces_per_sec: 10
    policies:
      - name: errors-always
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: slow-tail
        type: latency
        latency: { threshold_ms: 500 }
      - name: baseline
        type: probabilistic
        probabilistic: { sampling_percentage: 1 }

Common mistakes

Alerting on causes, not symptoms

"CPU > 80%" fires at 2 AM and turns out to be a benign batch job. The on-call wakes up, sees no user impact, mutes the alert — and then misses the real one. Page on user-visible symptoms (SLO burn, error rate); use resource metrics for triage, not for paging.

Unbounded log cardinality

Logging per-request with user_id and request_id in the metric labels explodes Prometheus cardinality. Keep metric labels low-cardinality (endpoint, status, region); put high-cardinality attributes in logs and traces.

No SLO — only uptime

"99.9% uptime" doesn't tell you anything without a definition of "up". Define an SLI (a measurable indicator — e.g. fraction of 2xx responses with duration < 200 ms) and an SLO (target over a window) before the system ships. "Uptime" without this is vibes.

Logs as metricsAdvanced

Grepping logs to count errors is fine at low scale and catastrophic at high scale. If you'd query it often, it's a metric — a counter or histogram. Logs are for narrative, not aggregation.

100% trace samplingAdvanced

Sampling every trace at 1M QPS melts the trace backend and your storage bill. Sample at 1–10%, with tail sampling to always keep traces of errors and slow requests.

Practice drills

Write an SLO for a checkout API. Include SLI, target, window.Reveal

SLI: fraction of POST /checkout requests returning 2xx/3xx with server-side duration < 500 ms, measured from the load balancer. SLO: ≥ 99.9% of qualifying requests meet the SLI over a rolling 28-day window. Error budget: 0.1% × 28d × traffic ≈ 40 minutes of full outage equivalent. Fast-burn alert: consuming > 2% of budget in 1h → page. Slow-burn: > 10% in 72h → ticket.

Interviewer: "what's the difference between an SLI, an SLO, and an SLA?"Reveal

SLI (Service Level Indicator) is the measurement — a metric you compute, e.g. success ratio. SLO (Service Level Objective) is the internal target you promise your own org — e.g. SLI ≥ 99.9% over 28 days. SLA (Service Level Agreement) is the external commitment with contractual teeth — usually looser than the SLO (e.g. SLO 99.9%, SLA 99.5%) so you have internal margin before you owe customers credits.

Your service is up but users are complaining. Metrics are green. What are the likely blind spots?Reveal

(1) You're measuring server-side success but client-side retries or network errors are invisible — add RUM or synthetic checks. (2) p99 looks fine but p99.9 is terrible — a minority of users hit a slow path (hot key, specific shard). (3) Downstream dependency is slow-but-succeeding — your SLI should include duration, not just error rate. (4) Errors are buried in 200-with-JSON-error responses — classify semantically, not by HTTP status alone.

Cheat sheet

•Three pillars: metrics (aggregate), logs (narrate), traces (connect).
•Four golden signals: latency, traffic, errors, saturation.
•RED per request-driven service; USE per resource.
•SLO = target on an SLI; error budget = 1 − SLO, over a window (28–30 days).
•Alert on burn rate (fast + slow), not raw error counts or CPU.
•Metric labels = low cardinality. High-cardinality data → logs / traces.
•Trace sampling: 1–10%, always-sample on errors and slow tails.

Practice this skill

These problems exercise Observability & operations. Try one now to apply what you just learned.

rate limiter

7% complete

Current

Read this if

Step 1 of 14

The concept

Jump to next

Metrics

Logs

Traces

Purpose

Alert + trend

Narrative + audit

Causality + latency breakdown

Cardinality

Low (labels bounded)

Unbounded (but sampled)

Sampled (1–10%)

Cost profile

Cheapest

Most expensive at scale

Medium (sampling-controlled)

Query shape

Time-series aggregation

Full-text + structured

Span tree per trace id

When to reach for it

Dashboards, alerts

Post-incident forensics

"Where did my 2 seconds go?"

# Google SRE workbook ch. 5 "Alerting on SLOs". # SLO: 99.9% success. Error budget = 0.1%. groups: - name: checkout-slo rules: # Fast-burn: 2% of 30-day budget consumed in 1h -> page. - alert: CheckoutSLOFastBurn expr: | ( sum(rate(http_requests_errors_total{service="checkout"}[1h])) / sum(rate(http_requests_total{service="checkout"}[1h])) ) > (14.4 * 0.001) for: 2m labels: { severity: page } annotations: summary: "Checkout burning SLO budget fast (1h window)" # Slow-burn: 5% of budget in 6h -> ticket, investigate same day. - alert: CheckoutSLOSlowBurn expr: | ( sum(rate(http_requests_errors_total{service="checkout"}[6h])) / sum(rate(http_requests_total{service="checkout"}[6h])) ) > (6 * 0.001) for: 15m labels: { severity: ticket } annotations: summary: "Checkout burning SLO budget slowly (6h window)"

# Keeps cost bounded at 1% baseline while ALWAYS retaining the # traces you actually need at 3 AM: errors and slow-tail latency. processors: tail_sampling: decision_wait: 10s num_traces: 100000 expected_new_traces_per_sec: 10 policies: - name: errors-always type: status_code status_code: { status_codes: [ERROR] } - name: slow-tail type: latency latency: { threshold_ms: 500 } - name: baseline type: probabilistic probabilistic: { sampling_percentage: 1 }