Design: Design an AI Inference Gateway

Endpoints

Add the operations your service exposes. Method, path, and status codes make your API much easier to review.

Start with a template

Applies to all endpoints

Policies that aren't specific to a single endpoint — auth, rate limits, versioning, and other notes.

Authentication

Rate limiting

Versioning

Notes (idempotency, redirect choice, error shape, anything else)

Diagram

Draw the components and how traffic flows between them. Notes below the canvas are optional — the Walkthrough panel is the primary place to narrate flows.

Draw the architecture so the components and connections tell the core story; flow narration in Notes is optional but recommended.

Components

Click palette â†’ add

Drag edge dot â†’ connect

Double-click node/edge â†’ rename

Shift+drag or box-select â†’ multi

Click any component on the left to add it here.

Drag the edge of a node to connect it to another.

Double-click any arrow to label it.

Notes (optional) — flow narration, trade-offs, chosen alternatives0 elements on canvas

Request walkthrough

Trace each core requirement as an ordered sequence of hops through your diagram. Use component names from your canvas for the From / To columns.

Route prompt requests to multiple model providers (OpenAI, Anthropic, in-house) with a per-tenant configurable fallback chain.

FromToAction / payload

Stream model outputs token-by-token end-to-end (SSE) with low time-to-first-token.

FromToAction / payload

Per-tenant rate limits (rps and tokens-per-minute) and monthly cost budgets, enforced before forwarding upstream.

FromToAction / payload

Response cache keyed on (tenant, model, system prompt, user prompt, sampling params) with safe semantics for non-deterministic outputs.

FromToAction / payload

Audit every request: tokens in/out, cost, latency, cache_hit, fallback_used, region, model — billable and queryable per tenant.

FromToAction / payload

Storage schema

For each entity, declare how it's stored. Sharding key is the interesting one — pick the access pattern it optimises for.

Tenant

A customer organization. Has API keys, model preferences, fallback chain, region pin, monthly budget, rate caps.

In-memory / derived

Storage type

Primary key

Sharding / partition key

Critical fields

Notes (indexes, TTL, access pattern)

PromptRequest

An inbound inference call: messages, model hint, sampling params, optional fallback chain override.

In-memory / derived

Storage type

Primary key

Sharding / partition key

Critical fields

Notes (indexes, TTL, access pattern)

CacheEntry

A reusable response keyed on (tenant, model, system_prompt, user_prompt, temperature, top_p). Skipped when temperature > 0.

In-memory / derived

Storage type

Primary key

Sharding / partition key

Critical fields

Notes (indexes, TTL, access pattern)

AuditRecord

Per-request row: tokens_in, tokens_out, cost_cents, latency_ms, ttft_ms, cache_hit, fallback_used, region, model, request_id.

In-memory / derived

Storage type

Primary key

Sharding / partition key

Critical fields

Notes (indexes, TTL, access pattern)

BudgetCounter

Real-time per-tenant counter (cost + tokens) reset on month boundary; checked atomically before forwarding.

In-memory / derived

Storage type

Primary key

Sharding / partition key

Critical fields

Notes (indexes, TTL, access pattern)

Component choices

Pick one per row and give a one-line reason. These are the concrete technology decisions your diagram implies.

Load Balancer

How traffic is distributed to your app servers.

API Gateway / Proxy

The proxy / gateway tier that owns auth, routing, and per-tenant policy before traffic reaches the app.

Cache

Where hot reads are served from.

Queue / Stream

Async work buffer for writes/fan-out.

Database

Primary durable store for entities.

Topology

Where the decision is made — at the edge or a shared service.

Your diagram

No components drawn yet — edit the diagram before answering.

Iterate on your design — don't start over.

Each scenario below probes a specific weakness in a typical HLD. Reference components from your diagram by name, describe what breaks and at what load, then name the minimum change that fixes it. Strong answers identify the precise failure mode — not just "scale it up".

A tenant suddenly jumps from $100/day average to $5,000 in one hour. Walk me through what happens automatically, and what an on-call engineer does.

Probes: observability

Your answer

Both your primary AND secondary providers go down simultaneously for 10 minutes. What happens? What does the customer see?

Probes: failure mode analysis

Your answer

Design an AI Inference Gateway

API & core entities

Endpoints

Applies to all endpoints

High-level design

Deep dives