Abuse prevention & rate limiting
Token bucket, sliding window, per-user vs per-IP, bot detection.
Rate limits are the only thing between your free tier and a botnet. A system without them is not a product — it's a target.
Read this if your last attempt…
- You didn't mention rate limits in your API design
- You'd limit per-IP and call it done (goodbye, corporate NAT)
- You can't explain token bucket vs sliding window
- You don't know how distributed rate limiting stays consistent
The concept
Rate limiting answers two distinct questions: "how fast can a client go?" (capacity protection) and "is this client abusing us?" (fraud/abuse). The algorithm, the identity axis, the response contract, and the failure mode all matter — each is a design decision, not a default.
Four design axes
1. Algorithm — how you count.
- Token bucket — refills at R tokens/sec, capped at B. Allows bursts up to B then sustains R. Default choice. O(1) storage (one counter + timestamp).
- Leaky bucket — queued at the rim, drains at R/s. Smooths bursts into constant output. Adds latency — good for outbound shaping to a 3rd party, bad for inbound user requests.
- Fixed window counter — count per calendar minute. Edge-doubling problem: client hits B in the last second of minute 1 and B in the first of minute 2 → 2B in 2 seconds. Avoid unless you accept the 2× burst.
- Sliding window log — exact count via timestamp list. O(requests-in-window) memory; rarely the right tradeoff at scale.
- Sliding window counter — weighted blend of two fixed windows. ~95% accurate, O(1). Good compromise.
- GCRA (Generic Cell Rate Algorithm) — leaky-bucket semantics in O(1) with a single timestamp. Stripe uses this for high-QPS production limiters.
Bucket refills at R tokens/sec, capped at B. Request consumes 1 token or is rejected.
Algorithm choice — pick by burst tolerance and storage cost.
| Algorithm | Bursts | Accuracy | Cost |
|---|---|---|---|
| Token bucket | Yes (up to B) | Good | O(1) — two numbers per key |
| Leaky bucket | No — smooths to R/s | Exact rate | O(1) + optional queue |
| Fixed window | Up to 2B at edges | Poor | O(1) — one counter |
| Sliding log | Yes | Exact | O(requests in window) |
| Sliding counter | Yes | Very good | O(1) — two counters |
How interviewers grade this
- You name the algorithm (token bucket default) and the burst + sustained rate.
- You list the identity axes you limit on (IP, API key, user, path).
- You distinguish local (per-gateway) from global (cross-gateway) limiting.
- You specify the response (429 with Retry-After; standardised headers like RateLimit-*).
- You size the limiter storage (Redis ops/s, memory).
Variants
Per-key token bucket in Redis
INCR + EXPIRE per (identity, bucket) — one round-trip per request.
The default for authenticated APIs. Key = (api_key, bucket), Redis INCR returns the new count; compare to limit. Cluster keys by hash-tag so the same identity hits the same shard.
Pros
- +Globally consistent
- +Cheap per request (~1 ms Redis round-trip)
- +Scales to millions of keys
Cons
- −Redis is a dependency on the hot path
- −Slight request-latency tax
- −Hot keys (celebrity users) skew shards
Choose this variant when
- Public APIs
- Need cross-gateway consistency
- Redis already on the stack
Local limit with periodic sync
Each gateway keeps its own counter; sync to shared store every N seconds.
Eliminates Redis round-trip per request. Accepts small over-limit windows (the sync interval). Fine for DDoS protection; unsafe for precise quotas.
distributed-limiterThe regional variation: each region has its own bucket, usage streams to a central reconciler, and global quotas are enforced at billing time instead of on the hot path.
Pros
- +Zero per-request external call
- +Survives Redis outage
- +Low latency tax
Cons
- −Over-limit by up to (gateway count × local bucket) in worst case
- −Not suitable for precise billing quotas
Choose this variant when
- High-QPS edge
- Approximation is acceptable
- Need to survive limiter outage
Layered (edge + app + backend)
IP-level at CDN/edge, API-key at gateway, user-id at app.
Defence in depth. Each layer catches a different abuse class: CDN catches volumetric attacks, gateway catches API-key theft, app catches authenticated abuse.
layered-rate-limitA limit at only one layer is brittle. CDN alone misses authenticated abuse; gateway alone is bypassed by volumetric floods. Layer all three so failure at any single point leaves the rest enforcing.
Pros
- +Multiple abuse classes covered
- +Failure of one layer doesn't expose the system
Cons
- −Operational complexity
- −Must coordinate limits to avoid false positives
Choose this variant when
- Production public APIs
- Systems that have been abused once already
Worked example
Design: rate limits for a public REST API.
Layers:
- 1CDN: per-IP, 1000 req/min. Catches volumetric abuse before it hits origin. Cloudflare / CloudFront rules.
- 2API gateway: per-API-key, token bucket, burst=100, sustain=10/s (free tier); burst=1000, sustain=100/s (paid). Redis-backed with INCR + PEXPIRE.
- 3Per-endpoint overrides: POST /search is 10× cheaper internally than POST /export; separate buckets.
Response contract:
- 429 Too Many Requests with Retry-After header.
- RateLimit-Limit / RateLimit-Remaining / RateLimit-Reset headers on every response (standard draft).
Sizing:
- 10k customers, peak 1k active, 100 req/s each → 100k req/s at peak.
- Redis: 100k INCR/s, well within a single c5.xlarge Redis.
Edge cases:
- Corporate NAT: don't limit only on IP. Always combine with API key / cookie where possible.
- Authenticated but anonymous: per-user fallback is IP + cookie hash.
- Burst-friendly clients: size burst at 10× sustained for UX friendliness.
Good vs bad answer
Interviewer probe
“How do you rate-limit your API?”
Weak answer
"1000 requests per minute per IP."
Strong answer
"Token bucket per API key — burst 100, sustained 10/s for free tier; 10× that paid. Redis-backed with INCR+PEXPIRE. CDN does a per-IP volumetric layer at 1000/min to absorb blunt attacks before they hit origin. Response is 429 + Retry-After + RateLimit-* headers so clients can back off gracefully. Per-endpoint buckets for expensive operations — /export doesn't share a bucket with /search."
Why it wins: Names algorithm, identity axis, layering, storage, response contract, and per-endpoint scaling.
When it comes up
- Any public API — the interviewer will ask "what stops abuse?"
- When you're designing a free tier or quota system
- During capacity discussions — a rogue client can DoS you
- When authentication or API keys come up
- When an expensive endpoint appears in the design (search, export, heavy compute)
Order of reveal
- 1State the two problems rate limits solve. "Capacity protection (a rogue client can't overwhelm us) and abuse/fraud (one identity can't farm quota). Different axes, sometimes the same tool."
- 2Pick the algorithm with a reason. "Token bucket by default — burst B, sustained R. Allows legitimate bursts without punishing the user; simpler to explain than sliding windows."
- 3Name the identity axis, not just the IP. "Primary axis is API key for authenticated traffic. Per-user inside a shared key if abuse is per-user. IP only as a CDN-level volumetric safety net."
- 4Layer the limits. "Three layers: CDN does per-IP volumetric, gateway does per-API-key, app does per-user and per-expensive-endpoint. Each catches a different abuse class."
- 5Commit to a storage design. "Redis with INCR + PEXPIRE per (key, bucket). One round-trip per request, ~1 ms added latency. Clustered by hash-tag so the same identity hits the same shard."
- 6Specify the response contract. "429 Too Many Requests, Retry-After header, and RateLimit-Limit / Remaining / Reset headers on every response. Clients can back off gracefully."
- 7Name the failure mode. "When the limiter is down, fail open for DDoS-style limits and fail local for quota-style limits. Never hard-fail the whole API because the counter is unreachable."
Signature phrases
- “Token bucket is the default; justify anything else” — Prevents fixed-window-edge-doubling mistakes.
- “Per-IP only denies service to corporate NATs” — Catches the single most common junior answer.
- “Layered limits: CDN, gateway, app” — Defence in depth in one phrase.
- “Fail open, not closed” — Prevents turning a limiter outage into a full API outage.
- “Expensive endpoints get their own bucket” — Shows you've costed work, not just request count.
- “429 with Retry-After and RateLimit-* headers” — The exact response contract clients need.
Likely follow-ups
?“Walk me through an exact token bucket implementation in Redis. What commands, what edge cases?”Reveal
Data structure per identity: A hash bucket:{api_key} with two fields:
tokens— current token count (float, supports partial refill).last_refill— millisecond timestamp of the last refill.
Request flow (atomic via Lua script to avoid races):
-- KEYS[1] = bucket key, ARGV = {now_ms, refill_rate, burst}
local now = tonumber(ARGV[1])
local rate = tonumber(ARGV[2]) -- tokens per ms
local burst = tonumber(ARGV[3])
local data = redis.call('HMGET', KEYS[1], 'tokens', 'last_refill')
local tokens = tonumber(data[1]) or burst
local last = tonumber(data[2]) or now
-- Refill based on elapsed time, capped at burst
tokens = math.min(burst, tokens + (now - last) * rate)
if tokens >= 1 then
tokens = tokens - 1
redis.call('HSET', KEYS[1], 'tokens', tokens, 'last_refill', now)
redis.call('PEXPIRE', KEYS[1], 60000) -- TTL idle identities out
return 1 -- allow
else
redis.call('HSET', KEYS[1], 'tokens', tokens, 'last_refill', now)
return 0 -- deny
endWhy a Lua script: two-command sequences (GET then SET) race under concurrency — two gateways could both see 1 token and both allow. The script is atomic inside Redis.
Edge cases:
- 1First request from an identity — key doesn't exist; initialise to full burst. Handled by
or burst/or now. - 2Long idle period — without TTL, Redis keeps the key forever. PEXPIRE resets on every access; idle keys evict automatically.
- 3Clock skew between gateways — use
redis.call('TIME')inside the script instead of client-sent time. Cleaner but one extra cmd. - 4Hot key — one celebrity API key could overload the shard hosting its bucket. Mitigate with client-side local pre-check (skip the Redis call if locally we know the bucket is very empty) + sharding the bucket across
bucket:{api_key}:{shard_n}.
Per-request cost: ~1 ms (Redis RTT within the region) + ~50 μs of script execution. At 100K QPS, one Redis primary handles this comfortably.
?“What happens when Redis goes down? What does the system do?”Reveal
Two strategies depending on what the limit protects:
1. DDoS-style volumetric limits: FAIL OPEN.
- If the limiter is unavailable, allow the request.
- Log the gap; alert on sustained fail-open.
- Rationale: a DoS attack on the limiter (or a Redis cluster incident) should not itself cause a full API outage. Fail-open trades some worst-case overage for resilience.
- Downstream capacity absorbs it; overload alarms fire if there is actually a flood.
2. Quota / billing limits: FAIL LOCAL.
- Each gateway falls back to an in-memory bucket with a conservative limit (say 1/N of the global, where N is the gateway count).
- Still enforces some limit; accepts over-limit by a factor during the outage.
- Logs the identities that were throttled locally — post-incident, reconcile billing.
- Rationale: a 15-minute outage where a single customer consumes 10× their quota is a billing issue, not a business risk.
3. Hard, must-enforce limits (financial, compliance): FAIL CLOSED.
- Reject requests. Return 503 Service Unavailable.
- Only for flows where "serve no request" is genuinely better than "serve a request unlimited".
- Rare in practice. Most limits should not be in this category.
The anti-pattern to avoid: fail-closed by default. A single-point-of-failure limiter on the hot path is an outage waiting to happen. Default to fail-open, escalate fail-local only where quota matters, fail-closed almost never.
Operational tooling for fail-open:
- Kill switch — feature flag to force fail-open without redeploying. Useful when Redis is slow but not down.
- Circuit breaker on Redis calls — if Redis p99 spikes, short-circuit and fail-open proactively instead of letting every request pay the timeout.
- Degraded-mode metric — dashboard shows "limiter is in degraded mode" so incident response knows.
?“A user hits their rate limit. What should they see, and what headers does your response carry?”Reveal
Status code: 429 Too Many Requests (RFC 6585).
Body (JSON):
{
"error": "rate_limited",
"message": "API rate limit exceeded. Retry after 12 seconds.",
"limit": 100,
"remaining": 0,
"reset_at": "2026-04-23T14:35:00Z",
"docs_url": "https://docs.example.com/api/rate-limits"
}Headers (draft-ietf-httpapi-ratelimit-headers standard, increasingly supported):
RateLimit-Limit: 100— the policy limit.RateLimit-Remaining: 0— how many requests they have left in the current window.RateLimit-Reset: 12— seconds until the limit resets (or a unix timestamp withRateLimit-Policy).Retry-After: 12— how long to wait before retrying (seconds or HTTP-date). Required — this is how well-behaved clients back off.
On every response (not just 429): Emit RateLimit-Limit, RateLimit-Remaining, and RateLimit-Reset. Clients can proactively throttle themselves before hitting the wall. This reduces 429 volume and makes abuse detection clearer (abusive clients ignore these headers).
What NOT to do:
- Silent drops. Closing the connection without a 429 leaves the client guessing between network failure and rate limit.
- 403 Forbidden. Implies permanent denial; clients won't retry. Use 429.
- 200 with error in body. Breaks clients that rely on HTTP status for retry logic.
- Exponential backoff math. Server tells the client how long to wait (Retry-After); don't rely on clients to calculate it.
Abuse-detection signals hidden here:
- Clients that ignore Retry-After and hammer anyway are candidates for aggressive throttling or IP-level blocks.
- Clients that back off exactly as instructed are good citizens; whitelist candidates for higher limits.
?“How do you handle rate limits for a globally distributed API with gateways in 10 regions?”Reveal
The core tension: global consistency (a single quota across regions) vs per-request latency (cross-region Redis call adds 50-150 ms).
Three strategies, pick by use case:
1. Regional limits, no global sync. Each region enforces its own bucket (e.g., 100K req/min per region per key). Users stay in their region (geo-DNS). Simple, low-latency, per-request cost ~1 ms local Redis.
- Overage: worst case, a user with cross-region requests gets N× their intended limit (where N = number of regions). Usually acceptable.
- Use case: most public APIs. Customers rarely cross regions; the overage is small.
2. Global limit with local leases. Each region gets a share of the global quota — e.g. the us-east gateway is leased 60% of the 100K/min limit. Regions negotiate lease sizes periodically based on demand (every 30 s). Requests check the local lease without cross-region round-trip.
- Overage: bounded to the lease granularity. Adjustment lag means a surging region is briefly under-provisioned before the next negotiation.
- Use case: workloads with high imbalance (big customer concentrated in one region). Closer to global accuracy without per-request latency.
3. Strict global via cross-region Redis (or a consensus store). Every request hits a single authoritative limiter, possibly replicated for availability but coordinating through Raft or similar.
- Overage: none (strict).
- Cost: per-request RTT includes cross-region hop. ~100 ms added latency for cross-continent.
- Use case: hard compliance / billing quotas where accuracy matters more than latency. Rarely the right answer.
Practical recommendation for most APIs: per-region limits with a global cap on paid tiers enforced asynchronously (at billing time, not at request time). This gives low latency for the hot path and accurate billing for customers, at the cost of accepting small intra-second overages.
The interview takeaway: don't propose strict global limits reflexively. Name the overage tolerance and match the strategy. "Per-region, global billing reconciliation" is the senior answer.
?“Token bucket vs leaky bucket vs sliding window — when would you actually pick each?”Reveal
Token bucket — the default.
- Allows bursts up to B then sustained at R.
- Matches typical user behavior (occasional bursts) without punishing legitimate spikes.
- O(1) storage, trivial to implement in Redis.
- Pick when: most APIs. Unless you have a specific reason, this.
Leaky bucket — smoothing, not limiting.
- Queue at input, constant drain rate.
- Bursts get delayed, not rejected — adds latency.
- Pick when: you have a hard downstream rate limit (e.g., you call a 3rd-party API that accepts max 10 req/s) and want to smooth your outbound calls rather than drop them. It's an outbound shaping tool, not an inbound rejection tool.
- Don't pick when: your users prefer 429s to silent 5-second waits. Queueing violates user expectations.
Fixed window counter — avoid.
- Count requests per calendar minute.
- The edge-doubling problem: a client can do B in the last second of minute 1 and B in the first of minute 2 → 2B in 2 seconds.
- Pick when: you explicitly accept the 2× burst and want O(1) simplicity. Rare.
Sliding window log — accurate but expensive.
- Store every request timestamp, count those in the last N seconds.
- O(requests-in-window) memory; O(log n) or O(n) per check.
- Pick when: you need exact accuracy for small N (tiny per-second bucket) and memory isn't a concern. Rarely the right tradeoff at scale.
Sliding window counter — the compromise.
- Two adjacent fixed windows, weighted by position in current window.
- ~95% accurate, O(1) storage.
- Pick when: you want fixed-window simplicity without the edge-doubling problem. Good for analytics-style limits where a small error is acceptable.
GCRA (Generic Cell Rate Algorithm) — the sophisticated option.
- Mathematically equivalent to leaky bucket, but O(1) with a single timestamp (no queue).
- Used by Stripe's rate limiter; precise and cheap.
- Pick when: you want leaky-bucket semantics without queueing cost. High-QPS production limiters.
The interview default: token bucket. If asked why not X, name the specific tradeoff (edge-doubling for fixed-window, added latency for leaky-bucket, memory cost for sliding-log).
Code examples
-- KEYS[1] = bucket key (e.g. "rl:api-key:abc123")
-- ARGV[1] = capacity (B) ARGV[2] = refill rate (R tokens/sec)
-- ARGV[3] = now (ms) ARGV[4] = cost (usually 1)
-- Returns: { allowed (0/1), tokens_remaining, retry_after_ms }
local capacity = tonumber(ARGV[1])
local rate = tonumber(ARGV[2])
local now = tonumber(ARGV[3])
local cost = tonumber(ARGV[4])
local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens = tonumber(data[1]) or capacity
local last = tonumber(data[2]) or now
-- Refill since last check, capped at capacity.
local elapsed = math.max(0, now - last)
tokens = math.min(capacity, tokens + (elapsed * rate / 1000.0))
local allowed = 0
local retry_after = 0
if tokens >= cost then
tokens = tokens - cost
allowed = 1
else
retry_after = math.ceil((cost - tokens) * 1000.0 / rate)
end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', now)
-- Key expires after 2× fill time — limits memory for idle keys.
redis.call('PEXPIRE', KEYS[1], math.ceil(2 * capacity * 1000 / rate))
return { allowed, math.floor(tokens), retry_after }import type { Request, Response, NextFunction } from 'express';
import Redis from 'ioredis';
const redis = new Redis(process.env.REDIS_URL!);
const bucketScript = /* the Lua script above, pre-loaded via SCRIPT LOAD */ '';
export function rateLimit(opts: { capacity: number; refillPerSec: number }) {
return async (req: Request, res: Response, next: NextFunction) => {
const key = `rl:${req.header('X-API-Key') ?? req.ip}`;
const now = Date.now();
try {
const [allowed, remaining, retryMs] = (await redis.evalsha(
bucketScript, 1, key,
opts.capacity, opts.refillPerSec, now, 1,
)) as [number, number, number];
// Emit headers on EVERY response so clients can self-throttle.
res.setHeader('RateLimit-Limit', opts.capacity);
res.setHeader('RateLimit-Remaining', remaining);
res.setHeader(
'RateLimit-Reset',
Math.ceil((opts.capacity - remaining) / opts.refillPerSec),
);
if (allowed === 1) return next();
const retrySec = Math.ceil(retryMs / 1000);
res.setHeader('Retry-After', retrySec);
return res.status(429).json({
error: 'rate_limited',
message: `Rate limit exceeded. Retry after ${retrySec}s.`,
limit: opts.capacity,
remaining,
retry_after: retrySec,
});
} catch (err) {
// Fail OPEN — limiter outage must not take down the API.
req.log?.warn({ err }, 'rate-limit: redis unavailable, failing open');
return next();
}
};
}Common mistakes
Corporate NATs hide thousands of users behind one IP. Rate-limiting the IP denies service to everyone in the office. Use API key or user id for authenticated paths; IP only as a CDN-level safety net.
Attacker times bursts across the window boundary → 2× the intended limit in 2 seconds. Prefer sliding window or token bucket.
Clients back-off blind — either too aggressively (bad UX) or too gently (sustained over-limit). Emit Retry-After and the RateLimit-* headers on every response.
Limiter dies → all requests fail. Fallback: fail open (serve but log), or switch to local-counter mode until the limiter recovers. Never hard-fail the whole API because the quota check failed.
Practice drills
Explain token bucket in 30 seconds.Reveal
Bucket holds up to B tokens. Refills at R tokens/second. Every request removes one token; if the bucket is empty, reject. This allows bursts up to B then sustained at R/s. One counter + one timestamp per identity.
Your rate limiter is a single Redis. What happens when it goes down?Reveal
Two options: (a) fail-open — let requests through while the limiter is unavailable, log the gap, let downstream capacity absorb it; (b) fail to local counters — each gateway falls back to in-memory limiting with a conservative (lower) limit until Redis recovers. Almost always fail-open for DDoS-scale limits and fail-local for quota-style limits. Never fail-closed on the critical path without a very good reason.
Interviewer: "should limits be per API key or per user?"Reveal
Depends on who the API key represents. If one key per end-user (mobile app embedding a per-user token), they are the same. If one key per developer app used by many users, limit per key (the app) and separately per user where identifiable. The rule: limit at whatever axis matches the cost. Abuse by one user inside a shared key is a per-user problem; abuse of the key itself is a per-key problem.
Cheat sheet
- •Default algorithm: token bucket. Burst B + sustained R.
- •Default storage: Redis with INCR + EXPIRE. One round-trip per request.
- •Limit on identity, not just IP. API key > user id > IP.
- •Layer: CDN (volumetric) → gateway (per-key) → app (per-user, per-path).
- •Response: 429 + Retry-After + RateLimit-* headers.
- •Fail-open on limiter outage — don't DoS yourself.
- •Expensive endpoints get their own bucket.
Practice this skill
These problems exercise Abuse prevention & rate limiting. Try one now to apply what you just learned.