Loading…
Loading…
Add the operations your service exposes. Method, path, and status codes make your API much easier to review.
Policies that aren't specific to a single endpoint — auth, rate limits, versioning, and other notes.
Diagram
Draw the components and how traffic flows between them. Notes below the canvas are optional — the Walkthrough panel is the primary place to narrate flows.
Request walkthrough
Trace each core requirement as an ordered sequence of hops through your diagram. Use component names from your canvas for the From / To columns.
Deliver internal events to subscribed customer HTTP endpoints with at-least-once durability — every accepted event lands on every active subscription, or terminates as dead-letter.
Retry failed deliveries with exponential backoff and jitter; up to 12 attempts spread across 24 hours before dead-lettering.
Sign every webhook with HMAC-SHA256 over (timestamp + body) using a per-subscription secret; receivers reject if timestamp is older than 5 minutes (replay protection).
Per-subscription event-type filter (e.g. charge.* matches charge.succeeded, charge.failed).
Customer dashboard with full delivery history (each attempt: status, response code, latency, request body) and one-click replay for any event in last 30 days.
Per-customer fairness — one slow endpoint must not delay deliveries for other customers.
Storage schema
For each entity, declare how it's stored. Sharding key is the interesting one — pick the access pattern it optimises for.
A customer-registered endpoint: customer_id, URL, event-type filter, secret, active flag, retry policy override.
An internal event with id, event_type, body, created_at, retention_until. Produced once, fanned out to N subscriptions.
Per-(subscription × event × attempt) row: attempt_seq, status, response_code, latency_ms, last_error, next_retry_at.
A delivery that exhausted retries — surfaces in dashboard, replayable manually within 30 days.
Component choices
Pick one per row and give a one-line reason. These are the concrete technology decisions your diagram implies.
How traffic is distributed to your app servers.
The proxy / gateway tier that owns auth, routing, and per-tenant policy before traffic reaches the app.
Async work buffer for writes/fan-out.
Where hot reads are served from.
Primary durable store for entities.
The async worker tier that drains the queue and does the slow work (delivery, fan-out, embedding).
Where the decision is made — at the edge or a shared service.
Your diagram
No components drawn yet — edit the diagram before answering.
Iterate on your design — don't start over.
Each scenario below probes a specific weakness in a typical HLD. Reference components from your diagram by name, describe what breaks and at what load, then name the minimum change that fixes it. Strong answers identify the precise failure mode — not just "scale it up".
A customer with 10K dead-lettered events from yesterday clicks "replay all" on the dashboard. What happens?
Probes: abuse rate limiting
A customer's endpoint goes from 100ms responses to 30s responses for 10 minutes, then back to normal. Walk me through their queue depth and dispatch rate minute-by-minute.
Probes: failure mode analysis