Read-heavy
When to reach for this
Reach for this when…
- The interviewer states or implies a read:write ratio of 10:1, 100:1, or higher
- Content is shared, linked, browsed, or broadcast — URL shortener, news feed, product catalog, video metadata
- A single write fans out to a long tail of reads — one upload, millions of views
- NFRs emphasise latency and availability over write durability
- You catch yourself sketching read paths and thinking "this endpoint will get hammered"
Not really this pattern when…
- Writes are roughly as frequent as reads (live chat, collaborative editors, IoT telemetry)
- Every read must reflect the latest write within the same request (banking, inventory, ticket allocation)
- Data is highly personalised per user with no sharing across users (private DMs, account settings)
- The system is small enough that one well-indexed primary handles the load comfortably
Good vs bad answer
Interviewer probe
“How do you scale the read path?”
Weak answer
"Add Redis. That should handle the reads."
Strong answer
"Layered, with hit-rate targets per layer. CDN in front for anything publicly cacheable — 70%+ edge hit rate on shareable URLs, shell-and-slot for logged-in pages. In-process LRU on the app servers — top 100K keys, 5-second TTL — absorbs viral spikes before Redis sees them. Shared Redis behind that, cache-aside, 10-minute TTL with versioned keys for invalidation correctness, probabilistic early refresh for stampede defence. Read replicas for the long tail of cache misses; primary only on rare uncached reads or after a recent write (5-second primary pin for read-your-writes).
At this topology 95%+ of reads never reach the primary. The new bottleneck is cache memory and replication lag, both with explicit budgets and alerts. Hot keys get fanout — profile:taylor:0 through :9 — once monitoring shows them. I track hit rate per layer and alert on a 10% drop, because that is how I find the working set outgrowing the cache before users do."
Why it wins: Names every layer with a hit-rate target, addresses stampede and hot keys explicitly, has an answer for invalidation correctness and read-your-writes, and says what the new bottleneck will be — the signal of having actually operated this.
Cheat sheet
- •Read-heavy is a RATIO statement, not a throughput statement. State it explicitly.
- •Layer caches: CDN -> in-proc LRU -> shared Redis -> replicas -> primary. Each absorbs a slice.
- •Assign a hit-rate target to every layer. Alert on 10% drops. That is how you find regressions.
- •In-process LRU is a SHIELD on top of Redis, not a replacement for it.
- •Staleness budget first; TTL is just the budget made concrete.
- •Probabilistic early refresh, not distributed locks, for stampede defence.
- •Hot keys: in-proc LRU first, key fanout (suffixed copies) second, hot replica pool third.
- •Versioned cache keys sidestep the entire invalidation race-condition class.
- •Read-your-writes: primary pinning for N seconds OR write-through cache. Pick one and document it.
- •CDN with personalised content: shell-and-slot. Cache the shell long, fetch the slot per user.
- •Cold cache after deploy is a real outage. Roll cache fleets one shard at a time.
- •When this stops working: working set too big to cache, OR writes catch up to reads, OR personalisation defeats sharing.
Core concept
Read-heavy is a ratio statement, not a throughput statement. A service doing 50 writes per second against 5,000 reads per second is read-heavy. A service doing 50,000 writes per second against 50,000 reads per second is balanced, even though 50K reads/sec sounds like a lot. The distinction matters because the architectural moves are different: read-heavy systems push reads as far from the database as possible, while balanced and write-heavy systems shape the write path itself. Get the ratio wrong in the first two minutes and every later decision drifts in the wrong direction.
The intuition that holds across every read-heavy system is the hit-rate cascade. Each layer in front of the database has one job, which is to absorb traffic so the next layer down sees less. A million reads at the edge become three hundred thousand at the load balancer, one hundred and fifty thousand at the in-process LRU, fifteen thousand at the shared cache, and roughly ten thousand at the database. The database only sees about one percent of the original wave. That arithmetic — not horsepower, not clever code — is what lets a single primary survive at consumer scale.
Each layer absorbs a slice. The primary only sees the residual ~1% of the original wave — that arithmetic is what lets a single DB survive at consumer scale.
Most candidates jump straight to "add Redis" and stop there. The senior move is to be explicit about which layer absorbs which slice and to assign a hit-rate target to every layer. Targets are how you find regressions: when the in-process LRU drops from 50% to 20%, the access distribution has shifted and the database is about to feel it. When Redis drops below 90%, your working set has outgrown your cache RAM and you are quietly paying for a sharding decision you have not made yet. Without the targets, you only notice when something is on fire.
Behind the cascade are four design choices that most teams trip over. The first is the staleness budget. Every read-heavy system has one — "five minutes is fine for a product card", "thirty seconds for a price", "one second for an inventory count" — and that single number drives TTLs, write-through versus write-behind, and whether you need read-your-writes pinning. Without naming it, every cache decision becomes religious. The second is invalidation. The honest answer in production is "TTL plus explicit invalidation on every known write path", because invalidation alone is fragile (you will miss a path) and TTL alone is sloppy (users see stale data longer than they should). The third is the hot-key problem. Consistent distribution does not save you when one key gets a hundred times more traffic than the median; that is workload imbalance, not structural imbalance, and it needs replication or local caching, not better hashing. The fourth is read-your-writes: a user who just submitted a comment expects to see it instantly, even if your replicas are forty milliseconds behind the primary. You either pin them to the primary briefly, write through the cache, or accept they will hit refresh and curse your product.
Reads being far from the database is not just a performance story. It is a failure-domain story. When the database has a bad five minutes, the cache and the CDN keep serving the bulk of traffic. When the cache fleet has a bad five minutes, the in-process LRU and the database can together still serve the most popular keys. Layered caches mean layered redundancy — the same architecture that gives you throughput also gives you graceful degradation. The corollary is that a cold cache during a deploy or restart is a real outage in disguise. You will see plenty of postmortems where a perfectly fine database fell over because someone restarted Redis and every miss landed simultaneously.
The rest of this pattern walks the architecture from a single primary up through a CDN-dominant global topology, names the variants you can pick between at each step, and goes deep on the four traps that separate a senior answer from a staff one: cache stampede, hot-key replication, versioned invalidation, and read-your-writes consistency.
Canonical examples
- →URL shortener — extreme read skew on the redirect path
- →News feed read path — Facebook, LinkedIn, Twitter
- →Product catalog and search results — Amazon, Yelp, DoorDash
- →Video metadata, thumbnails, channel pages — YouTube
- →Event details and seat maps before a sale — Ticketmaster
- →Wikipedia-style content sites and DNS
Variants
Cache-aside with shared cache
The default. App checks Redis first, falls back to DB, populates cache on miss.
This is what 90% of read-heavy services run in production, and for good reason: the failure modes are well understood, every senior engineer has debugged one, and the library support is mature in every language. The contract is simple — read goes to cache; on miss, read from DB and populate; write goes to DB and invalidates (or updates) the cache.
The two implementation details that matter are TTL and what you do on cache miss. TTL is your staleness budget made concrete. Pick the number from the requirements ("product cards can be 5 minutes stale") and write it into the config; do not let it drift. On miss, naive code reads the DB and writes the cache without any coordination — fine when miss rate is low, catastrophic when a hot key expires (see the stampede deep dive). Production-grade cache-aside always pairs with single-flight or probabilistic early refresh.
The blind spot of cache-aside is invalidation. A write path that forgets to invalidate leaves stale data in the cache for the full TTL. Real teams plug this with explicit invalidation on every write path plus short TTLs as a safety net, plus CDC-based invalidation for the paths the app team did not write. The honest answer is "all three, layered".
Pros
- +Battle-tested; every cache library implements it natively
- +Cache failure is graceful — falls through to DB
- +TTL gives a hard upper bound on staleness
Cons
- −Invalidation is the team's responsibility on every write path
- −Cache miss does the full DB query — needs single-flight to avoid stampede
- −Two round-trips on miss (cache, then DB) — minor latency cost
Choose this variant when
- Default for any read-heavy service with a working set that fits in RAM
- When the access pattern has temporal locality (recently-read items get re-read)
- When 5-30 second staleness is acceptable
Read replicas without a cache
Scale reads at the database tier; skip the cache entirely.
Sometimes the right answer is no cache at all. If your data is highly relational, queries are varied, and the working set is too large to fit in cache RAM at any reasonable cost, read replicas are simpler than a cache layer that bounces between 40% and 70% hit rate.
Modern Postgres or MySQL on dedicated hardware will cheerfully serve 30,000-50,000 reads per second per replica, and you can run five or six replicas before coordinating gets painful. That is enough to handle the long-tail "browse" workload of a mid-sized e-commerce site without ever touching a cache. The win is operational simplicity: no TTL debate, no invalidation strategy, no stampede defence. The cost is replication lag — anywhere from milliseconds to seconds depending on write load — which means you need a story for read-your-writes.
Where this shines is complex query workloads: analytics-style reads, ad-hoc joins, full-text search done in the DB. A cache cannot easily memoise these because the query space is unbounded. Where it falls over is the hot-key problem. Replicas distribute load by query, not by key — the same celebrity profile fetched across all replicas still costs CPU on each one. For mixed workloads, you usually end up combining replicas (long tail) with a small cache (hot tail).
Pros
- +Fewer moving parts — no separate cache fleet to operate
- +No cache invalidation work
- +Strong consistency on the primary; replicas only lag, never diverge
Cons
- −Replicas cost full DB instances — far more $/QPS than Redis
- −Replication lag breaks read-your-writes without pinning
- −Hot keys still saturate every replica; cache is needed for that anyway
Choose this variant when
- Query patterns are too varied to memoise (lots of unique queries)
- Working set is too large for cache to be cost-effective
- You have strong DB ops muscle and weaker cache ops muscle
CDN/edge-dominant
Put a CDN in front of as much of the read path as you can. Origin only sees misses.
The cheapest read is the one your origin never sees. For publicly cacheable content — product pages without per-user data, public posts, video metadata, search results without personalisation — a CDN with a 30-90 second TTL absorbs 70-95% of traffic before it ever reaches your region. You pay flat-rate egress to the CDN, not per-request CPU at origin.
The art is in the cache key. CDNs cache by URL plus a controlled set of vary headers. If your URLs include user-specific query parameters or you vary on too many headers, every user gets their own entry and the hit rate collapses. The discipline is to keep cacheable URLs free of personalisation, push personal data into a separate request issued by the page itself (the "shell + slot" trick — see the deep dive), and sign URLs only when you must.
Modern edge platforms (Cloudflare Workers, Fastly Compute, Vercel Edge) blur the line further: you can run a small amount of code at every PoP, fetch a personalised slot from origin, and stitch it into a cached shell. Used well, you can get 80%+ edge hit rate even on logged-in pages.
Pros
- +Drops latency to 10-30ms globally without you running anything
- +Origin load reduces 5-20x for shareable content
- +CDNs handle DDoS-grade volumetric traffic for free
Cons
- −Cache key discipline is brittle — one wrong vary header and hit rate dies
- −Invalidation across edges is eventually consistent (seconds to minutes)
- −Personalised content needs the shell-and-slot pattern or stays at origin
Choose this variant when
- Content is shareable across users (public, anonymous, or coarsely segmented)
- You need global latency but write to a single primary region
- Static-asset workloads or read-mostly REST/GraphQL endpoints
Materialised read model
Maintain a separate denormalised table shaped exactly for the read query.
Sometimes the cheapest read is not the one you cache, but the one you precompute. For queries that join half a dozen tables or aggregate millions of rows, no amount of indexing will save you — the query is fundamentally expensive. A materialised read model flips the cost: writers pay a small extra cost to update a denormalised view; readers do a single-row, single-table lookup.
Writes hit the normalised OLTP store and feed a stream that materialises a read-shaped table. Reads become single-table lookups.
Implementations sit on a spectrum. At the simple end you have Postgres materialised views, refreshed on a schedule — fine when staleness of minutes is acceptable. In the middle you have double-writes from the application: every write goes to the normalised OLTP table and the denormalised summary table in the same transaction. This works but couples writers to the read shape; every new read pattern needs a writer change. At the high end you have CDC-driven materialisation: writers only touch the OLTP store, a CDC stream (Debezium, Postgres logical replication) feeds an event log, and a stateless materialiser job updates the read model. New read shapes mean a new materialiser, no writer changes.
The mistake to avoid is making the read model too far from the OLTP truth. Materialise the shape of the query, not a wholly different domain. Keep the relationship to the source obvious so debugging "why is the summary wrong?" stays tractable.
Pros
- +O(1) reads regardless of how complex the underlying join is
- +Pre-pays the read cost at write time, when the system is less loaded
- +Decouples read query shape from OLTP schema
Cons
- −Eventual consistency window between OLTP and view
- −Operational overhead — another store to back up, monitor, fail over
- −Schema evolution is painful when many materialisers are running
Choose this variant when
- Read query joins 4+ tables or aggregates millions of rows
- Staleness of seconds-to-minutes is acceptable
- The same expensive query is run by many users (high read amplification)
Scaling path
v1 — single primary, nothing fancy
Get something shipping. One DB handles everything.
Good for up to a few thousand RPS. Bottleneck: primary CPU and connection count.
Modern Postgres or MySQL on a properly-sized box happily serves a few thousand RPS for a normal CRUD workload. Zero cache, zero replica, zero operational surface. The thing that breaks is not the database — it is the team panicking and adding complexity before the database is the bottleneck.
When this is enough: internal tools, early-stage consumer products under ~10k DAU, B2B with low concurrent users, anything where you can fit the working set in DB RAM.
Anti-pattern at this stage: adding Redis "for performance" when your DB is at 12% CPU. You are paying invalidation tax for a bottleneck that does not exist.
What triggers the next iteration
- Primary CPU climbs past 60% sustained — read queries are the dominant cost
- Read latency degrades during write bursts (writes hold locks, reads queue)
- Connection pool exhausts under traffic spikes — clients see timeouts before the DB does
v2 — cache + read replicas
Keep the primary reserved for writes; absorb the read tsunami elsewhere.
Replicas absorb read load; Redis absorbs the hot slice. Writes still hit primary.
Add Redis (cache-aside) and 2+ read replicas behind the app. Route reads cache → replica → primary; writes still hit the primary and replicate out. Hit-rate target: 90%+ on the cache. Instrument and alert when it drops below 80% — that is your signal that the working set has outgrown cache RAM, not a transient blip.
The replication lag budget is where read-your-writes lives. Pick a strategy and write it down: (1) sticky-read to primary for N seconds after a write per user, (2) write-through cache so the user reads from cache even before the replica has caught up, or (3) semi-sync replication if your DB supports it. Without a documented strategy, every team picks a different one and you end up with weird user-visible bugs.
What triggers the next iteration
- Cache hit rate drops below 80% — working set outgrew cache RAM
- Replication lag exceeds your freshness budget under write bursts
- Hot key on one Redis shard saturates that shard while peers idle
v3 — CDN + edge in front of the region
Drop the majority of reads before they enter your region.
For public, cacheable, read-dominated workloads. 95%+ requests never reach origin.
Push public, cacheable content to a CDN with 30-90s TTLs. For logged-in pages, use the shell-and-slot pattern: cache the full HTML shell at the edge, fetch only the small personalised widget from origin. Goal: 80%+ edge hit rate on the read path.
Beyond CDN, in-process LRU on the app servers absorbs hot-key spikes before they hit Redis. A 50K-entry LRU keyed on user-ID + recent-N-items costs maybe 100MB RAM per app server and routinely catches 30-50% of cache traffic that would otherwise hit Redis. The combined topology is often called four-tier: CDN → in-proc LRU → shared Redis → DB.
What triggers the next iteration
- Personalisation fragments the cache key too much for the CDN to help
- Cross-region replication lag breaks read-your-writes for travelling users
- Origin cost actually rises because misses are expensive over long distance
v4 — multi-region with edge dominance
Serve global users from local edges and replicas; keep the write path simple.
CDN absorbs the bulk worldwide; regional replicas serve the rest; primary in one region for writes.
Once you have international users, every cross-region round trip is 100-300ms. The CDN gives you free global presence for cacheable content. For the rest, deploy regional read replicas and a regional cache, with the primary in one region for writes. Reads serve from the closest replica; writes round-trip to the primary region. This is active-passive at the data tier — far simpler than active-active and good enough for most read-heavy workloads.
The thing that breaks at v4 is not architecture, it is cache coherency across regions. A write in US-east invalidates Redis in US-east immediately, but EU-west may serve stale data for the replication lag plus the invalidation broadcast time. Either accept the staleness, broadcast invalidations on a global pub/sub channel, or use versioned cache keys (see the deep dive) and let stale entries become naturally unreachable.
What triggers the next iteration
- Cross-region cache invalidation lags behind the primary by replication time
- Write latency from far regions becomes noticeable to users
- You start needing actual multi-region writes — a different pattern
Deep dives
When to add an index, when to denormalise, when to cache
Each layer absorbs a slice. The primary only sees the residual ~1% of the original wave — that arithmetic is what lets a single DB survive at consumer scale.
Three different tools, three different problems. Confusing them is one of the most common mid-level mistakes in interviews.
Indexing fixes single-table query cost. If your query is "find user by email" and the table has a million rows, you need an index on email. Without it the database scans every row; with it the lookup is a B-tree walk in microseconds. The tradeoff is write cost (every write updates every relevant index) and disk space (~10% of the indexed column's data per index). The fear of "too many indexes hurting writes" is wildly overblown for typical OLTP workloads — under-indexing kills more services than over-indexing ever has. In an interview, when you sketch a schema, name the indexes you would add right then. It signals you are thinking about the read path, not just the data shape.
Denormalisation fixes cross-table query cost. If your query joins users to orders to order_items to products and you serve it 10K times per second, no index will make it fast enough. Denormalisation stores redundant copies — the user's name lives on every order row, the product name lives on every order_item — and reads become single-table lookups. The cost is write complexity: a name change has to propagate everywhere the name is duplicated. For high read:write ratios this is a great trade. The materialised read model variant takes denormalisation to its logical conclusion: a separate table shaped exactly for the read query, kept up-to-date asynchronously.
Caching fixes repeated query cost. If the same query runs many times with the same answer, caching memoises the result. The cost is staleness (the cached answer can be wrong for the TTL window) and invalidation work (every write path must invalidate). Caching does not help with query cost itself — every cache miss still pays the full query cost. So caching makes sense only when (a) the same data is read many times, and (b) the underlying query is already reasonably fast. Caching a slow query just hides it; the first cache miss still hurts and the stampede on TTL expiry can take down the database.
The order matters. Add indexes first — they are free correctness. Denormalise second when query shape itself is the bottleneck. Cache last, after you have done everything you can to make the underlying read cheap. A team that reaches for Redis before adding the obvious index is shipping a more complex system to solve a problem they have not yet earned.
The hot-key problem and request coalescing
Celebrity profile cached under N suffixed keys. Clients pick a random suffix on read; load spreads across N shards instead of one.
A celebrity posts. Within seconds, a million users hit the same key in your cache. Your Redis cluster is happy serving the data — it is in memory, the lookup is O(1) — but a single Redis shard has finite CPU and finite network bandwidth. Serialising the same response a million times per second over a 10 Gbps NIC saturates the wire long before the CPU.
Consistent hashing does not help. By design every read of profile:taylor lands on the same shard. The whole point of the hash is determinism — and that is exactly what is killing you now. This is workload imbalance, not structural imbalance, and it needs different tools.
The first defence is in-process LRU on the app servers. Each app box holds the top-N hottest keys in local memory with a short TTL (5-10 seconds). On a hot-key event, the first request from each app server fetches once from Redis and populates the LRU; the next 99,999 requests on that server serve from local RAM. With 500 app servers and a 5-second LRU TTL, Redis sees roughly 500 requests per 5 seconds for that key — 100 RPS instead of 500K RPS. The cost is staleness inside the LRU window; the win is enormous.
The second defence is request coalescing (also called single-flight) for cache misses on the same server. When a hot key expires and a thousand concurrent requests on one server all miss simultaneously, naive code launches a thousand parallel fetches against Redis or the DB. Single-flight collapses them into one: the first request to miss takes ownership of the fetch and stores its in-flight Future on a map; every subsequent miss within the request looks up the map and awaits the same Future. The origin sees one query; the thousand callers all get the same answer.
Concurrent misses for the same key collapse into one origin call. All callers wait on a shared future and read the same result.
The third defence is key fanout — replicate the hot key to N slots and have clients pick a random slot per read. profile:taylor:0 through profile:taylor:9 each store the same data, distributed across ten different Redis shards. A million reads per second now spread across ten shards, 100K each, comfortably within a single-shard budget. The cost is N× memory and N× invalidation work. You only do this for keys you have identified as hot — usually a small celebrity-tier set surfaced by the cache hit-rate metrics — not for every key.
Celebrity profile cached under N suffixed keys. Clients pick a random suffix on read; load spreads across N shards instead of one.
In production these three defences usually combine. In-process LRU catches the bulk before Redis ever sees it. Single-flight protects the cache miss path. Key fanout protects the long-running hot keys that cannot be absorbed by the LRU alone. A staff-level answer names all three and explains which one solves which sub-problem.
Cache stampede on TTL expiry
When a hot key expires, every concurrent reader misses simultaneously. The origin gets the full uncached load until the cache repopulates.
Cache expiration is binary. One moment the entry is there; the next millisecond it is gone. For a key serving 100K reads per second, the moment of expiry is a synchronised cliff: 100K concurrent misses, all racing to refetch from the database. The database — sized for the normal cache-miss rate of maybe 1K queries per second — sees a 100x spike, queues fill, queries time out, the cache cannot repopulate, and you have a self-inflicted DDoS.
When a hot key expires, every concurrent reader misses simultaneously. The origin gets the full uncached load until the cache repopulates.
The naive fix is a distributed lock: only one process gets to refresh, others wait. This works in theory and breaks in practice. If the lock holder is slow or dies mid-refresh, every waiter sits idle until lock timeout, and your tail latency becomes pathological. Lock-based stampede defence requires careful timeout tuning, fallback behaviour for lock acquisition failure, and a story for what happens when the lock service itself is degraded. It is the classic fragile-by-default solution.
The better answer is probabilistic early refresh (the XFetch algorithm, from Vattani et al., 2015). Instead of waiting for TTL to expire, every read at fetch time computes a random refresh probability that rises as the entry approaches expiry. At 50% of TTL the chance is roughly 1%; at 90% of TTL it climbs to ~12%; at 99% it might be 40%. The stampede flattens into a smear of background refreshes spread across the last 10-20% of the TTL window. Most readers still get cached data; a handful of unlucky ones trigger a refresh; the database sees a steady trickle instead of a synchronised cliff.
Refresh probability rises as TTL approaches. The stampede flattens into a smear of background refreshes; most readers still hit a fresh cache.
For your most critical keys — the ones whose stampede would take you down — you can go further with active background refresh. A worker continuously refreshes the entry well before TTL, so user reads never trigger a recompute at all. The cost is wasted work refreshing data that may not be requested; the win is that user-visible latency is never tied to recompute cost. You only do this for a small hand-picked set of keys (the homepage, the top product, the celebrity profile), not for everything.
The defences compose: probabilistic early refresh as the default for all keys, plus active background refresh for the small hot tail you cannot afford to let stampede. The distributed-lock approach is rarely the right answer in production — name it in the interview to show you know about it, then explain why you would prefer the probabilistic approach.
Cache invalidation: versioned keys and the deleted-items cache
Writer increments version atomically with the row. Readers compose the key from the current version, so old entries are simply unreachable.
"There are only two hard things in computer science: cache invalidation and naming things." — Phil Karlton. He was not joking. The naive approach is "DEL the cache entry on every write", and it has three failure modes that you will eventually hit. First, the application has many write paths and one of them forgets to invalidate. Second, the data is cached in multiple places (app cache, edge cache, browser cache) and the DELETE only fires on one of them. Third — the killer — there is a race window between the DELETE and the next write where a concurrent reader can fetch stale data, populate the cache with it, and undo your invalidation.
The race goes like this: writer commits the DB update at T=0; writer DELETEs the cache at T=1ms. But between T=0 and T=1ms a reader checks the cache, sees the old entry (still there, since DELETE has not landed yet), and... actually that is fine, reader gets stale-by-1ms data. The real bug is more subtle: writer commits at T=0; reader misses cache at T=0.5ms (slightly before writer DELETE); reader queries DB, gets the new value, populates cache. Writer DELETE lands at T=1ms — and removes the new value. Now another reader misses, queries DB, sees the new value, populates cache with it. OK, the system recovers. But under contention this can loop, and you can get sustained windows of stale data.
The cleanest production answer for entity-level data is cache versioning. Each row has a version column, incremented atomically with every update inside the same transaction. The cache key includes the version: event:7:v42. When the writer increments to v43 and commits, every reader that fetches the row's version next will compose a key for v43, miss, and populate it cleanly. The old v42 entry is never deleted — it is just unreachable, and falls out of the cache via TTL eventually.
Writer increments version atomically with the row. Readers compose the key from the current version, so old entries are simply unreachable.
The beauty of versioning is that it sidesteps every race condition. A "late writer" cannot stomp the new version because version increments are atomic in the database. There is no partial-invalidation problem because nothing is being deleted. CDN edges and browser caches naturally expire stale versions because the version is part of the URL or response key. The cost is two cache lookups per read (one for the version, one for the data) and accumulated stale entries — both manageable with a small versions cache and reasonable TTLs.
Versioning shines for single-entity caches (user profiles, product details, event metadata). It does not directly solve computed cache invalidation — feeds, search results, aggregates that combine many entities. For those, a complementary trick is the deleted-items cache: a small, fast-changing set of "things that were just removed/hidden/changed", consulted as a filter when serving cached aggregates. A user blocks another user; the blocked user's ID goes into a deleted-items cache for that user's feed; the cached feed is served as-is, with the blocked user's posts filtered out at render time. The cached aggregate stays valid; the deleted-items cache is small enough to stay fast.
Combining versioned keys for entities and a deleted-items cache for aggregates handles the vast majority of real-world invalidation problems without ever needing a hard "purge all caches" button.
Read-your-writes without breaking the cache
After a write, pin that user's reads to the primary for a short window so they never read a stale replica.
User submits a comment, the page refreshes, and... the comment is not there. They click again. Still not there. Third click, it shows up. The user concludes your product is broken; the engineer looks at the code and shrugs because the comment did write successfully. The bug is replication lag colliding with the read-your-writes expectation.
The user model is simple: "if I just did this, I should see it immediately." The system is more complicated: you wrote to the primary, the cache was invalidated, but the reader landed on a replica that is 200ms behind the primary, and the cache has not yet been repopulated either, and so the reader sees a stale view. Multiply by a thousand users and you have a steady stream of "your product is broken" support tickets.
There are three production-tested fixes, each with tradeoffs.
Primary pinning is the most common. After a user writes, set a short-lived flag in the user's session ("wrote-recently") with an expiry slightly longer than your replication lag — say 5 seconds. While the flag is set, route that user's reads to the primary, bypassing replicas. After the flag expires, normal replica routing resumes. This is precise (only the user who just wrote pays the primary-load cost) and bounded (the window is short by design). The downside is bookkeeping: you need a session, a flag, and a routing layer that knows about it.
After a write, pin that user's reads to the primary for a short window so they never read a stale replica.
Write-through cache is more invasive but elegant when it fits. Instead of "write DB, invalidate cache", the write path goes "write DB, update cache to the new value". The reader hits the cache, sees the new value, and never queries a replica that might be stale. This works beautifully for entity reads and breaks down for aggregate reads where the cache cannot be updated cleanly from a single write. Most production systems use it for the user's own profile and recent-activity reads, and fall back to primary pinning for everything else.
Quorum reads at the database level — semi-sync replication, Aurora's read-after-write consistency, Spanner's external consistency — push the problem down to the data tier. The application reads from any replica, but the replica is guaranteed to have applied the write before serving the read. The cost is write latency: the primary cannot ack the write until enough replicas have applied it. Some databases support this natively; many do not. Where it works, it is the cleanest answer because the application code does not need to think about lag at all.
In an interview, name primary pinning as the default, mention write-through for entity caches, and reach for quorum reads only when the data store gives them to you for free.
CDN with logged-in content: the shell-and-slot trick
Render the cacheable shell at the edge; fetch only the small personalised slot from origin. The bulk of the page hits the CDN.
The first 80% of CDN benefit comes from caching publicly shareable URLs — product pages, blog posts, public profiles. The next 15% comes from the trick that lets you cache logged-in pages too: separate the page into a shared shell and a personalised slot, and cache only the shell.
Render the cacheable shell at the edge; fetch only the small personalised slot from origin. The bulk of the page hits the CDN.
The shell is everything that does not change per user: the layout HTML, the navigation, the product details, the comments, the article body. It can be cached at the edge with a long TTL because nothing about it depends on who is requesting it. The slot is the small piece that does vary per user: their name in the header, their saved-items count, their cart, their recommendation rail. The slot is fetched separately, either by a small client-side request after the shell renders or — better — by an edge worker that stitches the slot into the shell at the PoP before sending the response.
Done well, you get edge hit rates on logged-in pages that approach what you see on anonymous pages. The shell hits cache; the slot hits origin; users see the page in 30ms instead of 300ms.
There are two failure modes to watch for. First, leaky personalisation in the shell: a developer adds the user's name to a shell template "just for this one feature", and now every shell request varies by user, and the cache hit rate collapses to zero overnight. Mitigations: lint rules, code review for shell templates, and a hit-rate alert on any shell URL whose hit rate drops below 60%. Second, slot fan-out: every page renders a slot, and the slot endpoint becomes the new bottleneck. Mitigations: cache the slot itself (very short TTL, varied by user ID), batch multiple slots into one origin call, or push the slot data into the JWT/session cookie if it is small enough.
The pattern generalises beyond CDN. The same shell-and-slot decomposition applies to GraphQL persisted queries (cache the query response, fetch the user-specific fragment separately), mobile API responses (cache the public payload, fetch the personal extras), and even SSR streaming (flush the shell early, await the slot).
Once the team is fluent with shell-and-slot, the operational rule becomes: if it is not a slot, it does not vary. Anything that varies must be a slot, with its own cache key, its own TTL, and its own latency budget. That discipline is what keeps a CDN-dominant architecture from drifting back to a CDN-bypassed one as the product grows.
Case studies
Memcache at planet scale (Nishtala et al., NSDI 2013)
Facebook's Memcache deployment is the canonical reference for read-heavy at extreme scale. By 2013 the published numbers had a single regional cluster handling billions of requests per second with a 99.9th percentile under a few milliseconds. The architecture is layered exactly the way every read-heavy system eventually becomes: per-cluster mcrouter proxies in front of memcached pools, regional pools shared across clusters, and a write-aware invalidation pipeline (McSqueal) that reads MySQL binlogs and broadcasts invalidations to every mcrouter.
The two ideas worth memorising are leases and the gutter pool. Leases are Facebook's stampede defence and their thundering-herd defence rolled together. When a client misses, mcrouter hands it a token; only the lease-holder is allowed to populate the cache, while concurrent missers get a "wait briefly" signal. If the lease holder dies, the lease times out and another client gets a fresh one. This is request coalescing implemented at the proxy tier — the application code does not have to know about it. The gutter pool handles cache failures: if a memcached node goes down, mcrouter routes its keys to a small "gutter" pool with a short TTL, so a server failure does not become a database thundering herd. The gutter is sized for the miss rate during a node failure, not the full hit rate, which is why it can be much smaller than the primary pool.
The invalidation pipeline is the part most teams skip and regret. McSqueal tails the MySQL binlog, parses out which keys are affected by each write, and broadcasts invalidations to every regional mcrouter. The application code does not invalidate the cache directly — the binlog is the source of truth, and missed invalidations from buggy app paths are impossible by construction. The cost is a build-out you only justify when the invalidation correctness problem becomes existential.
Takeaway
At extreme scale the cache invalidation problem is solved at the data tier (binlog → broadcast), not the app tier. Stampede and node-failure defences move into the proxy layer (leases, gutter pool) so application code stays simple.
EVCache — multi-region replicated cache
Netflix's EVCache is a Memcached-based distributed cache with two design choices that set it apart from a vanilla deployment: per-region replication and active cache warming. The published metrics put it in the 200+ billion ops/day range across the fleet, with sub-millisecond p99 read latency.
Per-region replication means every cached item is stored on multiple nodes within a region, behind a client-side library that knows the replica set. A client read goes to one replica; on miss it tries the next; only after exhausting the in-region replica set does it fall back to the underlying datastore (Cassandra in many cases). The benefit is that a cache node failure does not become a cache miss for the keys it owned — peers serve them. The cost is N× memory per item, but cache RAM is cheap relative to the database CPU you save by not having to refill cold keys on every node failure.
Cache warming is the operational habit that separates Netflix's cache layer from most companies'. When a new node joins or an existing node restarts, an internal "warmer" service replays a recent slice of the production traffic against the new node — copying values from peers — so the node enters service already populated. Without warming, a cold node taking traffic would multiply database load by every request its peers normally would have absorbed; with warming, a deploy or a node bounce is invisible to the database. Netflix has talked publicly about how this single discipline lets them roll cache fleets aggressively without operational drama.
The architectural lesson generalises: at sufficient scale, cache state itself becomes operational state worth managing. Backups, warming, replication, and failover for the cache are not optional infrastructure; they are first-class concerns. Teams that treat the cache as ephemeral end up with mysterious database CPU spikes during otherwise routine deploys.
Takeaway
Cache state worth managing as first-class infrastructure: replicate within region, warm new nodes from peers, and treat cache deploys with the same care as DB deploys. Saves the database from every restart wave.
URL shortener — the extreme read:write ratio
A URL shortener is the purest read-heavy workload that exists. A user creates a short link once. The link is then redirected on every click — by their followers, by retweets of their tweet, by anyone who finds the link in a years-old article. The lifetime read:write ratio for a single popular link easily reaches the millions. Across the corpus of all links, Bitly publicly cited handling on the order of billions of clicks per month with a tiny write rate by comparison.
The architectural conclusions follow directly from the ratio. First, the redirect path is essentially always cached. Short codes map to long URLs, and the mapping is immutable for the lifetime of the link — you never need to invalidate. A simple Redis lookup with effectively infinite TTL handles 99.99% of redirect traffic. The database is consulted only on the rare cache miss for a long-tail link nobody has clicked in months, plus on the create path for new links. Even a modest Redis cluster with mostly-permanent caching dominates a database with no caching here.
Second, the CDN is doing a huge amount of the work. A short link clicked from a popular Twitter post will see the same URL fetched millions of times within minutes. CDNs cache 301/302 redirects natively, and with appropriate cache headers (Cache-Control: public, max-age=86400), edge nodes absorb the bulk of redirect traffic without ever reaching origin. The origin sees per-link traffic only on first click in a region or after edge expiry — a tiny fraction of the actual click volume.
Third, the analytics path is decoupled from the redirect path. Every redirect emits a click event into a queue (Kafka, similar). The redirect itself does not wait for analytics — it serves the 301 immediately and emits the event fire-and-forget. Analytics aggregation happens asynchronously: a stream processor consumes the events, updates per-link counters, and writes them to a separate analytics store. The redirect path stays a single Redis lookup; click counts converge in seconds without ever blocking the user.
What you learn from Bitly is the discipline of finding the immutable kernel. The short-code-to-URL mapping is immutable; cache it forever. The click count is mutable but eventually consistent; aggregate it asynchronously. The redirect is latency-critical; serve it from cache or CDN. Each piece gets its own treatment based on its own consistency and latency requirements. That decomposition is what makes a 10000:1 read:write ratio look easy.
Takeaway
When read:write is extreme, find the immutable kernel and cache it forever. Decouple analytics from the user-facing path. Push as much as possible to CDN.
Decision levers
Cache technology — Redis vs Memcached vs in-process
Redis is the default for most teams: data structures, TTL, pub/sub for invalidation, persistence options, mature client libraries. Memcached is brutally simple — strings only, fixed TTLs, no persistence — and shines when you want a pure cache and nothing else (Facebook's deployment is the canonical case). In-process LRU is not a replacement for either; it is a fronting tier. Use it for the top 10K-100K hot keys per app server. The combination "in-proc LRU on top of shared Redis on top of a DB" is what most production read paths actually look like.
Staleness budget
Name a number from the requirements. "Five-minute stale product card" or "thirty-second stale price" or "one-second stale inventory". This number drives TTL, cache pattern (write-through vs write-behind vs cache-aside), and whether you need read-your-writes pinning. Without naming it, you will have arguments about TTL forever; with it, the answer is mechanical.
Invalidation strategy
Three real options: TTL-only (sloppy but self-healing — fine for short TTLs), app-level invalidation on write paths (correct in theory, fragile in practice — easy to miss a path), and CDC-driven invalidation (binlog tail → broadcast → invalidate, expensive but bulletproof). Production-grade: combine TTL as a safety net with app-level for known paths, and reach for CDC when invalidation correctness becomes existential.
Read-replica consistency
Pick a story for read-your-writes and write it down. (1) Sticky-read to primary for N seconds after a user's write — most common, low cost. (2) Write-through cache so the user reads from the cache directly — cleanest for entity reads. (3) Quorum / semi-sync replication if the data store gives it to you — best when available. (4) Accept the lag for low-stakes data (analytics dashboards) — fine, just be explicit.
Hot-key handling
A read-heavy system with no hot-key plan is a future incident. Layered defence: in-process LRU catches most hot-key traffic before it hits the cache; single-flight handles miss-time stampedes; key fanout (replicate one key to N suffixes) handles celebrity-tier sustained hot keys. Surface "top-N hot keys" in your cache metrics — if you cannot list them, you do not have the visibility you need.
Failure modes
Every deploy flushes in-process LRU; if shared Redis also gets restarted or rolled, the primary is suddenly taking 100% of traffic. Fixes: rolling restarts (never restart the whole cache fleet at once), pre-warm new nodes from peers (Netflix EVCache style), warm-up traffic before shifting real traffic.
A viral item lands all traffic on one Redis shard while peers idle. Fix stack: in-process LRU absorbs most hot-key reads before they reach Redis; key fanout (suffixed copies) when the LRU window is too small; per-cluster replicas of the hot tier.
TTL fires, thousands of concurrent readers miss simultaneously, primary melts. Fixes: probabilistic early refresh as the default; active background refresh for the small set of keys you cannot afford to let stampede; single-flight at the application layer for the miss path.
A new write path forgot to call DEL on the cache; users see stale data until TTL fires. Fixes: short TTLs as a safety net; versioned cache keys so old entries are unreachable; CDC-driven invalidation when correctness matters more than infra cost.
User writes a comment, page refresh shows nothing, user clicks again, still nothing. Fixes: primary pinning for N seconds after a user write; write-through cache for entity reads; quorum reads when supported.
Someone added a query param or a vary header that fragments the cache key per user. Hit rate craters. Fixes: lint rule for cacheable URLs; alert on hit-rate drops > 10%; cookie/header allowlist at the edge; shell-and-slot for personalised content.
Decision table
Picking the right scaling move for your read pressure.
| Approach | Best for | Hit-rate ceiling | Staleness window | Operational cost |
|---|---|---|---|---|
| Index + denormalise inside DB | Single-table reads, mid traffic | n/a (query cost) | 0 (consistent) | Low |
| Read replicas | Varied query workloads, < 100K QPS | n/a | Replication lag (ms-s) | Medium |
| Cache-aside (Redis) | Hot working set, predictable keys | 90-95% | TTL window | Medium |
| CDN / edge | Public, shareable URLs | 70-95% | CDN TTL (s-min) | Low (managed) |
| Materialised read model | Complex joins, high read amplification | n/a | CDC lag (s) | High |
| Hot-key fanout | Celebrity-tier keys | Approaches 100% of cluster | TTL × N copies | Medium |
Worked example
Setup. The interviewer wants the read path for an event page on Ticketmaster — Taylor Swift Eras Tour, tickets going on sale at noon. Two minutes before sale, the page is being refreshed by tens of millions of fans. After the sale opens, the page must show real-time seat availability while every fan is hammering the refresh button.
The first move is to recognise that this is not one read pattern but four, each with its own cacheability and staleness budget.
- 1Static event metadata — title, date, venue, headliner image. Cacheable for hours; never changes. Goes into CDN with long TTL.
- 2Seat-map layout — the visual grid of the venue, which seat is which section. Cacheable for hours; the venue layout is fixed. CDN with long TTL.
- 3Per-seat availability — which seats are sold. Cannot be cached at the CDN; changes every second; must be near-real-time.
- 4Personalised state — user's saved seats, payment methods, ticket-limit eligibility. Per-user; not shareable.
This decomposition is what unlocks the rest. The first two pieces are 80% of the page weight and can sit at the edge essentially forever. The third piece is small (a sparse bitmap of which seats are taken) and has tight freshness requirements. The fourth piece is a small per-user fetch.
Architecture sketch. I would put a CDN in front of everything, with the event page rendered as a shared shell containing the metadata and seat-map layout. The shell hits CDN with a 60-second TTL — long enough to absorb the rush, short enough that any errata get fixed quickly. Inside the shell, two slots: one fetches per-seat availability from a regional Redis cluster; the other fetches user state from origin.
The per-seat availability cache is the interesting one. It is updated by the seat-reservation service every time a seat changes state, with a 1-2 second TTL as a safety net. Reads are extremely concentrated — every fan refreshing the same event is hitting the same handful of keys. This is a textbook hot-key scenario. Defence in depth: in-process LRU on the front-end app servers with a 1-second TTL absorbs most of the traffic; the Redis tier uses key fanout — event:123:seats:0 through event:123:seats:9 — so reads spread across 10 shards.
For read-your-writes: a fan who just clicked a seat and got a "selected" confirmation must see that seat in their selection. Two paths. The selection state for that user's session lives in a per-user Redis key, populated on the click. The global seat availability gets updated by the reservation service and propagates to everyone within 1-2 seconds. The user does not wait on the global propagation — they see their own selection from their session state immediately.
The write path stays separate. Actual seat purchase is a write — a different code path that goes straight to the primary database with a strong consistency guarantee, taking a row-level lock on the seat. The write path is sized for the actual purchase rate (a few thousand per second when the sale opens), not the read rate (tens of millions per second of refreshes). This is the most important architectural move: refusing to put the read pressure on the write path.
What breaks. The thing that breaks first when this goes live is the hot-key cache. If the LRU window is too short or the fanout is too narrow, the Redis cluster gets a single-shard CPU spike. Mitigation: bump the LRU window to 2 seconds (still inside the freshness budget), bump fanout from 10 to 20 keys, and pre-warm the cache 5 minutes before sale opens by replaying yesterday's traffic shape against the new event ID.
The result. At the topology above, the CDN absorbs 80%+ of total page-weight traffic. The Redis cluster sees the per-seat-availability load, with the hot-key defences keeping any one shard at 30% CPU. The primary database sees only the actual purchase writes plus rare cache misses on long-tail events nobody is interested in. Every fan gets a sub-100ms page load globally, and the primary survives the sale.
Interview playbook
When it comes up
- The interviewer specifies a read:write ratio of 10:1 or higher
- The prompt describes content that is shared across many users (URL shortener, news feed, product catalog)
- You sketch a HLD and they ask "how do you scale the read path?"
- A capacity-estimation step shows reads at 100K+ QPS while writes are at 1-10K
- They mention "thundering herd", "hot key", "cache stampede", or "celebrity tier" — all read-heavy probes
Order of reveal
- 1Recognise the ratio out loud. "This is read-heavy by roughly 100:1, so the architecture is going to be shaped by pushing reads as far from the primary as possible. The write path stays simple."
- 2Name the layered cascade with hit-rate targets. "CDN at the edge for shareable content — 70-95% hit rate. In-process LRU on app servers for hot keys — 50% of what gets through. Shared Redis for the working set — 90%+ hit rate. Read replicas for long-tail misses. Primary only for the residual ~1%."
- 3Pick a cache pattern and a staleness budget. "Cache-aside with a 5-minute TTL is the default. The staleness budget comes from the requirement; let me restate it: users tolerate roughly N-second stale reads on this data."
- 4Pre-emptively name stampede defence. "For hot keys I would use probabilistic early refresh so the TTL expiry does not synchronise; plus single-flight at the app layer for the miss path. For the celebrity tier, in-process LRU absorbs the bulk before Redis sees it."
- 5Address invalidation correctness. "Two layers: explicit invalidation on every write path, plus a short TTL as a safety net. For correctness-critical data I would use versioned cache keys so race conditions cannot leave stale entries reachable."
- 6Address read-your-writes. "After a user write I pin their reads to the primary for ~5 seconds via a session flag. Their cache entry gets updated through write-through so they see the new value immediately."
- 7Name the new bottleneck. "At this topology the bottleneck moves from DB CPU to cache memory and replication lag. I track hit rate per layer with alerts on a 10% drop — that is how I find the working set outgrowing the cache before users notice."
Signature phrases
- “Read-heavy is a ratio statement, not a throughput statement” — Frames the whole problem correctly in one sentence.
- “Hit-rate targets per layer, alerted on regression” — Signals you have operated this — most candidates do not even mention monitoring.
- “In-process LRU is a shield, not a replacement” — Distinguishes the four-tier topology from "just add Redis".
- “Staleness budget, then TTL” — Anchors cache decisions in requirements rather than vibes.
- “Probabilistic early refresh, not distributed locks” — Names the right stampede defence and rejects the fragile one.
- “Versioned cache keys sidestep the invalidation race” — Shows you know the subtle correctness bug that bites real systems.
Likely follow-ups
?“What if the cache fleet goes down?”Reveal
Three layers of defence. First, rolling restarts only — never restart more than one shard at a time, so the surviving shards keep absorbing the bulk of traffic. Second, gutter pool (the Facebook approach): when a shard fails, route its keys to a small fallback pool with a short TTL; the gutter is sized for the miss rate during failure, not the hit rate, so it can be much smaller. Third, graceful degradation at the app layer: if Redis is unreachable entirely, the app falls back to read replicas, with circuit breakers so it does not pile on the database when Redis is recovering. The thing you absolutely never do is "Redis is down so let everything hit the primary" — that is how a cache outage becomes a database outage becomes a full outage.
?“You said 90%+ Redis hit rate. What if it is 70%?”Reveal
That is the signal that your working set has outgrown your cache RAM. Two responses. Short term, scale the cache horizontally — more shards, more total RAM — or evict more aggressively (LRU is fine, but LFU may keep the long tail in cache better depending on access pattern). Longer term, examine why the working set is growing: is it actual data growth (more users, more items) or is it access-pattern fragmentation (someone added a new query that misses a different long tail)? Profile the cache misses by key prefix and you will usually find one feature is responsible for half of them. Sometimes the right fix is to cache that feature differently (a separate Redis cluster, a different TTL, a different key structure) rather than to scale the main cache.
?“Why probabilistic refresh instead of a distributed lock?”Reveal
Distributed locks couple the stampede defence to the availability of the lock service. If the lock service is slow or partitioned, your cache refresh stalls and every reader times out. The probabilistic approach is stateless and decentralised: each reader independently rolls a die based on age-vs-TTL, and the math guarantees that with N concurrent readers you get a small handful of refreshes spread across the last 10-20% of the TTL window, not a synchronised cliff. There is no shared state, no lock service to operate, and the failure mode is "one extra DB query" rather than "everyone waits". I would still reach for a distributed lock for the write path on a critical key (singleton refresh of a long-running aggregate), but for cache-aside refresh, probabilistic wins.
?“How do you handle a key that is read 1M times per second?”Reveal
Layered defence. First, in-process LRU on the app fleet — assume 1000 app servers, a 5-second LRU TTL: that is one Redis read per server per 5 seconds, so Redis sees 200 reads per second for that key, not 1M. Second, key fanout at Redis if the in-proc LRU is not enough or the freshness window is shorter — store the value under key:0 through key:9, clients pick a random suffix on read, load spreads across 10 shards. Third, replicate the hot tier: a small "hot pool" of Redis replicas dedicated to the celebrity-tier keys, with the application reading from a random replica. The first two are usually enough; the third is for sustained extreme cases like Twitter's most-followed accounts.
?“How do CDN cache invalidation and Redis cache invalidation differ?”Reveal
Redis invalidation is synchronous and within your control — a DEL or a versioned key update lands in milliseconds. CDN invalidation is asynchronous, eventually consistent across PoPs, and partly outside your control: a purge API call propagates to global edges over seconds to minutes, depending on the CDN. The implication: do not rely on CDN purge for correctness-critical updates. For data that must invalidate immediately, either do not edge-cache it (hit origin every time, cached only at Redis), or use cache versioning at the URL level — /event/7?v=43 — so a write changes the URL and old edges become naturally unreachable. The shell-and-slot pattern often helps: cache the shell long, push the volatile bits into the slot which has a much shorter edge TTL or hits origin every request.
?“When does this pattern stop being enough?”Reveal
Three signals. First, the working set itself is too big to cache cost-effectively: when adding more cache RAM costs more than just sharding the database, you have crossed the line. Second, writes start dominating because some new feature changed the workload — at that point you are no longer read-heavy, and the architecture should shape around writes (LSM-friendly stores, append-only logs, CQRS). Third, personalisation defeats sharing: if every user gets a unique answer to the same query, no cache layer above the per-user store can help. At that point you are into precomputation patterns (fan-out-on-write feeds, materialised per-user views) — which is a different pattern, not a different scale of this one.
Code snippets
import asyncio
from typing import Awaitable, Callable, TypeVar
T = TypeVar("T")
class SingleFlight:
"""Collapse concurrent loads of the same key into one call.
On a hot-key cache miss, a thousand concurrent readers all
`get(K)` simultaneously. Without this, a thousand parallel DB
queries fire. With this, one query fires; the other 999 await
the same future.
"""
def __init__(self) -> None:
self._inflight: dict[str, asyncio.Future[T]] = {}
async def do(self, key: str, loader: Callable[[], Awaitable[T]]) -> T:
existing = self._inflight.get(key)
if existing is not None:
return await existing # join the in-flight load
future: asyncio.Future[T] = asyncio.get_event_loop().create_future()
self._inflight[key] = future
try:
value = await loader()
future.set_result(value)
return value
except Exception as exc:
future.set_exception(exc)
raise
finally:
del self._inflight[key] # leave nothing for next caller
# Usage in a cache-aside read:
single = SingleFlight()
async def get_profile(uid: str) -> dict:
cached = await redis.get(f"profile:{uid}")
if cached is not None:
return cached
return await single.do(f"profile:{uid}", lambda: load_and_populate(uid))import math, random, time
# delta = how long the loader takes (seconds, EWMA)
# beta = aggressiveness; 1.0 is the original paper, 1.5-2.0 in practice
def should_refresh_early(
fetched_at: float, ttl: float, delta: float, beta: float = 1.0,
) -> bool:
"""Returns True with rising probability as the entry approaches expiry.
XFetch from Vattani et al. (2015). Each reader independently rolls a
weighted coin; the math guarantees that across many concurrent
readers, a small handful trigger a refresh in the last ~10-20% of
the TTL window — flattening the stampede instead of synchronising it.
"""
now = time.time()
expires_at = fetched_at + ttl
# rand in (0, 1); -ln(r) is exponentially distributed
rand = random.random()
# closer to expiry => bigger left-hand side; once it passes
# (expires_at - now), trigger.
return now - delta * beta * math.log(rand) >= expires_at
# Usage in a cache-aside read with metadata stored alongside the value:
async def get_with_xfetch(key: str) -> dict:
entry = await redis.hgetall(key) # value, fetched_at, delta, ttl
if entry and not should_refresh_early(
float(entry["fetched_at"]),
float(entry["ttl"]),
float(entry["delta"]),
):
return decode(entry["value"]) # serve cached, no refresh
# Either entry is missing, or we lost the early-refresh dice roll.
# Single-flight protects against concurrent refreshers on this server.
return await single.do(key, lambda: refresh_into_cache(key))import random
HOT_KEY_FANOUT = 10 # 10 copies for celebrity-tier keys
def hot_key(base: str, fanout: int = HOT_KEY_FANOUT) -> str:
"""Pick a random suffix on read; spreads load across `fanout` shards."""
return f"{base}:{random.randrange(fanout)}"
# READ — random suffix, normal cache-aside afterwards.
async def get_celebrity_profile(uid: str) -> dict:
base = f"profile:{uid}"
cached = await redis.get(hot_key(base))
if cached is not None:
return cached
# On miss, populate one suffix; over time all suffixes warm up.
return await single.do(base, lambda: load_and_populate_all_suffixes(base, uid))
# WRITE — invalidate ALL suffixes; failing to do this leaves stale copies.
async def invalidate_celebrity_profile(uid: str) -> None:
base = f"profile:{uid}"
pipe = redis.pipeline()
for i in range(HOT_KEY_FANOUT):
pipe.delete(f"{base}:{i}")
await pipe.execute()
# Note: only do this for keys you have IDENTIFIED as hot via metrics.
# Applying fanout to every key is just N x memory cost for no benefit.# The version is a column on the row, incremented atomically with every write.
# Readers compose the cache key from the version, so old entries are never
# seen again — they fall out via TTL eventually, but no race window matters.
async def update_event(event_id: int, fields: dict) -> None:
async with db.transaction():
await db.execute(
"UPDATE events SET name=$1, venue=$2, version=version+1 "
"WHERE id=$3",
fields["name"], fields["venue"], event_id,
)
# No DEL on cache. The old version's key becomes unreachable.
async def get_event(event_id: int) -> dict:
# Version lookup itself is cached with a short TTL — usually <1ms.
version = await get_event_version(event_id)
key = f"event:{event_id}:v{version}"
cached = await redis.get(key)
if cached is not None:
return cached
row = await db.fetchrow("SELECT * FROM events WHERE id=$1", event_id)
await redis.setex(key, 600, encode(row)) # 10 min TTL safety net
return row
async def get_event_version(event_id: int) -> int:
cached = await redis.get(f"event:{event_id}:ver")
if cached is not None:
return int(cached)
v = await db.fetchval("SELECT version FROM events WHERE id=$1", event_id)
await redis.setex(f"event:{event_id}:ver", 5, str(v)) # short TTL on the pointer
return v-- Normalised OLTP store (writers update this).
CREATE TABLE orders (
id BIGSERIAL PRIMARY KEY,
user_id BIGINT NOT NULL REFERENCES users(id),
placed_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE TABLE order_items (
order_id BIGINT NOT NULL REFERENCES orders(id),
product_id BIGINT NOT NULL REFERENCES products(id),
qty INTEGER NOT NULL,
unit_price NUMERIC(10,2) NOT NULL
);
-- Denormalised read model (readers read THIS, no joins).
CREATE TABLE order_summary (
order_id BIGINT PRIMARY KEY,
user_name TEXT NOT NULL,
placed_at TIMESTAMPTZ NOT NULL,
line_items JSONB NOT NULL,
grand_total NUMERIC(12,2) NOT NULL
);
-- Maintained by a CDC consumer (Debezium / Postgres logical replication).
-- On every (orders, order_items, products, users) change, recompute the
-- affected summary row and UPSERT it. Reads are then a single PK lookup:
SELECT line_items, grand_total
FROM order_summary
WHERE order_id = 12345;
-- Index Scan using order_summary_pkey (cost=0.42..8.44 rows=1)Drills
A read:write ratio of 1:1 with 50K reads/sec — read-heavy or not?Reveal
Not read-heavy in the architectural sense. The ratio is what matters; 50K writes/sec at parity with reads means the architecture has to shape around writes too — LSM-friendly stores, write-optimised partitioning, careful indexing. You would still cache hot reads, but the dominant scaling work is on the write path, not the read path. Read-heavy starts at roughly 10:1; below that, "balanced" is a more useful frame.
Your Redis hit rate dropped from 92% to 68% overnight. What changed and what do you check?Reveal
Three likely causes. (1) Working set growth: more data, working set no longer fits in cache RAM. Check cache memory usage and eviction rate — high eviction with hit-rate drop confirms it. Fix: scale Redis horizontally or add an LFU eviction policy. (2) Access-pattern fragmentation: a new feature or query is missing a different long tail. Profile cache misses by key prefix; one prefix usually accounts for most of the new misses. Fix: cache the new feature separately or rethink the key structure. (3) TTL change: someone shortened a TTL or added an aggressive invalidation pattern. Check recent config and code changes in the cache layer. Cause #1 is most common at sustained growth; #2 is most common after a feature launch; #3 is most common immediately after a deploy.
Explain why a distributed lock is the wrong stampede defence.Reveal
Distributed locks couple the cache refresh path to the availability of the lock service. If the lock service is slow, partitioned, or has a correctness bug (Redlock under network partitions, for example), your refresh stalls. Every reader waiting on the lock times out. Worse, the failure mode is "thousands of users waiting for a lock that will never be granted", which is operationally hard to recover from. Probabilistic early refresh has no shared state and degrades gracefully — at worst, you get a few extra DB queries; nobody waits. The exception is when you genuinely need a singleton refresh of a uniquely expensive aggregate (a recommendation model warm-up, say) — there a lock is correct because the cost of N parallel refreshes is far worse than serialising on a lock service.
A user posts a comment, refreshes, and does not see it. They are reading from a read replica with 200ms lag. Three production fixes in order of preference.Reveal
(1) Primary pinning — set a "wrote-recently" flag in the user's session for 5 seconds (twice the worst-case lag); during that window route their reads to primary. Cheap, precise, no architectural change. (2) Write-through cache — when the user posts, update the cache directly with the new value as well as writing to the DB. The user's next read hits the cache, not a replica. Clean for entity reads (their own profile, their own posts) but does not generalise to aggregate reads where cache update from a single write is awkward. (3) Quorum / semi-sync replication — at the data store level, configure the primary to ack writes only after N replicas have applied them. Best when supported (Aurora, Spanner, semi-sync MySQL); otherwise too costly to retrofit. Most teams use #1 as the default and add #2 for the user's own profile/feed reads.
Your celebrity profile gets 1M req/sec. Walk through the layered defence.Reveal
Layer 1 — In-process LRU. Assume 1000 app servers; LRU TTL = 5 seconds. Each server fetches once per 5 seconds. Redis traffic for that key: 1000 / 5 = 200 reads/sec, regardless of the user-facing 1M/sec. Layer 2 — Key fanout. If the LRU window is too long for the freshness budget, store the value under profile:taylor:0 through :9. Reads pick a random suffix; load spreads across 10 Redis shards, 100K reads/sec each, comfortably below a single-shard budget. Layer 3 — Hot replica pool. For sustained celebrity load, dedicate a small set of Redis replicas to the hot tier. The application library knows about the hot keys and reads from a random replica in the pool. Layer 4 — Cache the response itself, not the data. If the celebrity profile rendering is expensive, cache the rendered HTML/JSON, not just the underlying row. What you do NOT do: rely on consistent hashing to "spread" the hot key. By design, consistent hashing pins it to one shard. The whole point is to break that pinning intentionally with the layered defences above.
When does layering more cache stop being the right answer?Reveal
Three failure modes signal you have outgrown the pattern. (1) Personalisation defeats sharing. If every user gets a unique answer to the same query (heavily personalised feeds, per-user search ranking), no shared cache layer above the per-user store helps you. The answer becomes precomputation — fan-out on write to per-user inboxes — which is a different pattern (see fan-out). (2) Writes catch up to reads. New feature flips the ratio; you are no longer read-heavy. The architecture should shape around writes (LSM stores, append-only logs, CQRS). (3) Working set is unbounded. A long-tail access pattern with no temporal locality means every read is roughly a fresh read; cache hit rate stays low regardless of cache size. Solutions: bigger primaries, more replicas, sharding by access pattern. Adding more cache here is throwing money at a metric you cannot move.