Loading…

Design an AI Inference Gateway — SystemRound

Design an AI Inference Gateway

hardSystem design50 min5 stages

Asked atAnthropicOpenAIStripeVercelCloudflareDatabricksNotion

Design the inference layer for a B2B platform whose customers embed LLM features in their apps. Each tenant sends millions of prompt requests per day across multiple foundation-model providers. The gateway must route by tenant policy, cache responses safely, fall back across providers on outage, enforce per-tenant cost and rate budgets, stream tokens end-to-end, and keep an auditable log of every call. Cost runaway, provider outages, and noisy-neighbor tenants are the day-1 production failure modes.

Best after a few full reps. Expect follow-up questions, edge cases, and deeper trade-off discussion.

What this problem tests

Inference EndpointStreaming ProtocolControl-Plane APIsComponent LayeringStreaming PathCache + Fallback Integration

Round shape

5 stages

Time budget

50 min

Feedback loop

Grade anytime

Guided practice·Primary loop

Guided practice

Workspace-first, hints visible, stage retry available. The cheap, repeatable loop — build the answer shape before you take it under pressure.

Stage-by-stage workspace instead of a blank page.
Grade one stage or the whole answer whenever you want.
Compare your answer against reference checkpoints and model answers.

Solve once, compare against the checklist, then come back to the weak stage instead of starting over.

Mock interview·Pressure test

Mock interview

Strict timer, hints hidden, debrief deferred to the end. Use this once you can already structure a clean answer and want to pressure-test pacing and pushback.

Best once the answer shape is already in your head.
Pressure-test pacing, pushback handling, and communication.
Use diagnosis after the interview for exact misses and next study steps.

Best after one structured rep · timed · focused on pacing and communication.

Requirements

This is the framing pass. A strong answer quickly defines what the system must do, what quality bar it has to hit, and the numbers that will justify the rest of the design.

First 5 min of the round

What must exist

Functional Requirements

6 items

1Route prompt requests to multiple model providers (OpenAI, Anthropic, in-house) with a per-tenant configurable fallback chain.

2Stream model outputs token-by-token end-to-end (SSE) with low time-to-first-token.

3Per-tenant rate limits (rps and tokens-per-minute) and monthly cost budgets, enforced before forwarding upstream.

4Response cache keyed on (tenant, model, system prompt, user prompt, sampling params); safe for non-deterministic outputs.

5Audit every request: tokens in/out, cost, latency, cache_hit, fallback_used, region, model — billable and queryable per tenant.

6Below the line: PII redaction (opt-in), data residency (US/EU/APAC), Idempotency-Key for client retries.

What good looks like

Non-Functional Requirements

5 items

1Latency target is TTFT, not full-response. p50 < 400ms, p99 < 1200ms on cache miss; cache hits < 50ms. Streaming makes full-response latency meaningless.

299.95% gateway uptime even when ONE upstream provider is fully down for up to an hour. Multi-provider is the moat.

3Cache hit rate > 30% on chat/RAG workloads. Below 20% the gateway is paying provider cost on the hot path for nothing.

4Audit log durability < 0.01% loss. Billing depends on it — silent loss = silent revenue loss.

5Budget overshoot bounded to ~60 seconds of spend under any failure mode. Hard caps are real.

Numbers to anchor the design

Scale Estimation

5 items

1500 enterprise tenants. 50M requests/day → ~580 rps average, ~10K rps at peak (1.5x avg × 12-hour business window concentration).

2Long-tail: top 10 tenants drive 60% of volume. One noisy tenant can starve the rest if not isolated.

3Audit log: 50M rows/day × ~500 bytes (with prompt hashes, no full bodies) = ~25 GB/day → ~750 GB/month → ~9 TB/year. ClickHouse-shaped.

4Cache working set: assume top-1M unique prompts daily at 30% hit rate → ~300 GB working set with 1-week eviction. Redis-cluster sized.

5Concurrent streams: 10K rps × ~5s avg full response = ~50K concurrent streams at peak. Connection pool sizing falls out of this.

How the round unfolds

Each stage has a distinct job. Treat them like separate deliverables instead of one giant answer, and the round becomes much easier to navigate.

4 design stages · 40 pts after framing

🔌

Stage 2~7 min10 pts

API Design

Define the contract clearly: the endpoints, auth boundary, error semantics, and the one or two decisions that matter most.

What you should produce

Define the API. How do customer applications send prompts? How does streaming work end-to-end? How do tenants override defaults like model pinning...

Strong answers cover

Inference EndpointStreaming ProtocolControl-Plane APIs

🏗️

Stage 3~12 min10 pts

High-Level Architecture

Lay out the main components and trace the write path, read path, and any async path cleanly.

What you should walk through

Walk me through the high-level architecture.

Strong answers cover

Component LayeringStreaming PathCache + Fallback Integration

💾

Stage 4~10 min10 pts

Data Model & Storage

Pick the store, show the schema or key model, and explain why that storage choice fits the access pattern.

What you should lock in

Walk me through the data model.

Strong answers cover

Cache Key ConstructionAudit Log SchemaBudget Counter & Tenant Config

📈

Stage 5~16 min10 pts

Scaling & Deep Dive

Name the first bottleneck, failure modes, and the trade-offs that keep the system fast and reliable under pressure.

What you should pressure-test

Time for the deep dive. The most interesting failure modes here are provider outages cascading into the gateway, noisy-neighbor tenants starving ev...

Strong answers cover

Provider-outage handlingNoisy-neighbor isolationCost-runaway protectionObservability

What a strong first rep looks like

Scope clearly

Translate the prompt into concrete requirements, scale, and trade-offs before drawing architecture.

Stay stage-specific

Give APIs in the API stage, data models in the storage stage, and failure modes in scaling. Don't blur them together.

Iterate fast

Grade early, compare to the reference checkpoints, fix the biggest misses, and re-submit the weak stage instead of starting over.