Loading…
Loading…
Add the operations your service exposes. Method, path, and status codes make your API much easier to review.
Policies that aren't specific to a single endpoint — auth, rate limits, versioning, and other notes.
Diagram
Draw the components and how traffic flows between them. Notes below the canvas are optional — the Walkthrough panel is the primary place to narrate flows.
Request walkthrough
Trace each core requirement as an ordered sequence of hops through your diagram. Use component names from your canvas for the From / To columns.
Route prompt requests to multiple model providers (OpenAI, Anthropic, in-house) with a per-tenant configurable fallback chain.
Stream model outputs token-by-token end-to-end (SSE) with low time-to-first-token.
Per-tenant rate limits (rps and tokens-per-minute) and monthly cost budgets, enforced before forwarding upstream.
Response cache keyed on (tenant, model, system prompt, user prompt, sampling params) with safe semantics for non-deterministic outputs.
Audit every request: tokens in/out, cost, latency, cache_hit, fallback_used, region, model — billable and queryable per tenant.
Storage schema
For each entity, declare how it's stored. Sharding key is the interesting one — pick the access pattern it optimises for.
A customer organization. Has API keys, model preferences, fallback chain, region pin, monthly budget, rate caps.
An inbound inference call: messages, model hint, sampling params, optional fallback chain override.
A reusable response keyed on (tenant, model, system_prompt, user_prompt, temperature, top_p). Skipped when temperature > 0.
Per-request row: tokens_in, tokens_out, cost_cents, latency_ms, ttft_ms, cache_hit, fallback_used, region, model, request_id.
Real-time per-tenant counter (cost + tokens) reset on month boundary; checked atomically before forwarding.
Component choices
Pick one per row and give a one-line reason. These are the concrete technology decisions your diagram implies.
How traffic is distributed to your app servers.
The proxy / gateway tier that owns auth, routing, and per-tenant policy before traffic reaches the app.
Where hot reads are served from.
Async work buffer for writes/fan-out.
Primary durable store for entities.
Where the decision is made — at the edge or a shared service.
Your diagram
No components drawn yet — edit the diagram before answering.
Iterate on your design — don't start over.
Each scenario below probes a specific weakness in a typical HLD. Reference components from your diagram by name, describe what breaks and at what load, then name the minimum change that fixes it. Strong answers identify the precise failure mode — not just "scale it up".
A tenant suddenly jumps from $100/day average to $5,000 in one hour. Walk me through what happens automatically, and what an on-call engineer does.
Probes: observability
Both your primary AND secondary providers go down simultaneously for 10 minutes. What happens? What does the customer see?
Probes: failure mode analysis