Content recommendation
When to reach for this
Reach for this when…
- "Recommended for you" feeds
- Personalised home screens
- E-commerce cross-sell / up-sell
- "People you may know" / friend suggestions
- Music / video autoplay
Not really this pattern when…
- Small catalogue (<1000 items) — just popularity-sort
- Pure chronological feed (no personalisation)
- Regulatory prohibition on personalisation
Good vs bad answer
Interviewer probe
“Design a "Recommended for you" feed.”
Weak answer
"Train a neural network on click data and use it to recommend."
Strong answer
"Two-stage. Candidate generation blends channels: two-tower embedding ANN (top 500), collaborative filtering (top 200), trending + fresh items (top 100) — ~1000 candidates. Ranker is a GBDT scoring each candidate with features from a feature store: user 7-day activity, item age, item CTR, context (time, device). Feature store (Feast or in-house) serves the same features to training jobs and to the online ranker — no skew. Top 20 returned, with 10% epsilon-greedy swap-ins for exploration. Training data = logged impressions + clicks; retrain daily. Cold-start user: fall back to popular-in-segment. Cold-start item: content-based ANN from item metadata. Latency budget: 50 ms p99 — caps candidate count, batched feature fetch, warm cache."
Why it wins: Two-stage explicitly, feature-store unification, exploration, cold-start, and a latency budget.
Cheat sheet
- •Two stages: candidate gen (cheap, many) + ranker (expensive, few).
- •Always multi-channel candidate gen. No single source.
- •Feature store: same code for train + serve.
- •Start with GBDT. NN later.
- •Always explore. 5–10% random.
- •Retrain on a cadence matching drift.
- •Cold-start plan for users AND items.
Core concept
The two-stage pattern dominates: candidate generation narrows millions of items to ~1000 quickly; ranking scores those with an expensive model and picks top K.
Candidate generation techniques (cheap retrieval):
- Collaborative filtering / matrix factorisation — users who liked X also liked Y. Batch-computed; often via ALS or two-tower embeddings.
- Content-based — item embedding + user embedding; ANN (FAISS, ScaNN, Pinecone) for nearest neighbours. Millisecond retrieval on 100M+ items.
- Heuristic channels — "trending now", "new in your city", "friends of friends". Simple SQL/Redis aggregations. Always include a few.
Ranking (expensive scoring): a model (GBDT, DeepFM, two-tower) scores each candidate against rich features. Features come from a feature store (Feast, Tecton, in-house): user features (demographics, recent activity), item features (title, category, age, CTR), context features (time of day, device). Training labels come from implicit feedback (clicks, watch time).
The feedback loop: every serving generates logs (impressions, clicks, plays). Logs are the training data for tomorrow's model. Retrain daily / weekly. Watch for feedback-loop pathologies — the model recommends what it's already recommending and stops exploring.
Exploration matters: pure greedy → filter bubble. Inject a small fraction of out-of-distribution recommendations (epsilon-greedy, Thompson sampling) to keep the model learning.
Canonical examples
- →YouTube watch-next
- →Spotify Discover Weekly
- →Amazon "customers also bought"
- →Instagram / TikTok feed
- →Netflix home row
Decision levers
Candidate-gen source mix
Never one source. Always: CF (or two-tower embeddings) + content-based ANN + trending + a small fresh-items channel. Weighted blend keeps the slate diverse.
Offline vs online features
Batch features (user 30-day activity): computed daily, served from feature store. Real-time features (last 5 clicks): streamed via Kafka → feature store's online layer. Skew between offline training features and online serving features is the #1 cause of "model works in training, fails in prod".
Ranking model
Start with GBDT (LightGBM / XGBoost) — fast, interpretable, strong baseline. Graduate to NN (DeepFM, two-tower) when GBDT plateaus and you have the infra. Don't start with deep learning.
Exploration strategy
Epsilon-greedy (simple, lose small fraction to random) or Thompson sampling (principled, bandit framework). 5–10% exploration is typical. Without it, coverage collapses.
Failure modes
Training features use DB joins; serving features use cached values. They diverge; offline metrics lie. Fix: one codepath generates features for both train and serve (feature store).
Model recommends what it recommends; users click only what's shown; model reinforces. Coverage shrinks. Fix: exploration.
New user / new item has no history. Fix: content-based fallback for items; popularity-in-segment for users; progressive reveal as data accrues.
Model trained 6 months ago. User preferences drifted; item catalogue changed. Retrain on a schedule (daily for fast-moving catalogues).
Expensive NN scoring N candidates times K feature lookups = latency explosion. Budget: ~50 ms for the whole rec call. Cap candidate count; batch feature fetches; warm caches.
Drills
Why not just one big model?Reveal
Latency. Scoring a deep model against 100M candidates is impossible in 50 ms. Candidate gen is cheap retrieval (ANN, SQL) — narrows to thousands. Ranking is expensive scoring — runs on those thousands. Two stages = each tuned for its job.
Offline AUC up, online CTR flat. Diagnose.Reveal
Top suspects: (1) train/serve feature skew — offline features computed differently than online; (2) selection bias — training on logged impressions means the model learned what the old system showed, not what users actually want; (3) exploration rate too low — no new signal reaching the model; (4) business metric mismatch — AUC ≠ CTR. Investigate in that order.