Search over content
When to reach for this
Reach for this when…
- Full-text search over a corpus (millions+ of docs)
- Faceted search on e-commerce
- Log / metric exploration
- "Search for..." in the prompt — always this
Not really this pattern when…
- Lookup by exact key (that's a primary index)
- Dataset < 10k rows (SQL with a btree is fine)
- Structured-only queries on well-known columns (SQL wins)
Good vs bad answer
Interviewer probe
“Design product search for 10M products.”
Weak answer
"SQL with ILIKE '%term%'."
Strong answer
"Elasticsearch with BM25 base relevance. Postgres is source of truth; Debezium CDC publishes changes to Kafka; an indexer consumer upserts into ES. Fields: standard + English stemming on name/description; keyword (un-analysed) on categories and SKUs for exact filter; numeric on price and rating for sort + range. Query: multi_match with field boosts, function_score for popularity + recency, filters as must-clauses that don't affect the score. Rebuild: alias pattern — new index products-v4, bulk reindex from DB, atomic alias flip, keep v3 for rollback. Could use Meilisearch if we don't need ES's aggregation power — simpler ops."
Why it wins: Names engine, pipeline, analyzer strategy, BM25 + boosts, rebuild pattern, and the alternative for smaller scale.
Cheat sheet
- •DB = truth. Search index = derived view.
- •CDC / outbox publishes changes async.
- •<10M docs: Postgres FTS often enough.
- •>10M or facets: Elasticsearch / OpenSearch.
- •Alias pattern for atomic reindex + rollback.
- •BM25 base + function_score business signals.
- •LIKE is not search.
Core concept
Search engines (Elasticsearch, OpenSearch, Meilisearch, Typesense, Vespa) are built on an inverted index: for every term, a posting list of (doc_id, positions, frequency). A query looks up posting lists, intersects/unions doc_ids, scores with BM25, sorts, returns top K.
Three separable layers:
- 1Ingestion — DB is truth; CDC or outbox publishes changes to the indexer, which updates the search engine. Async; accept a few seconds of search lag.
- 2Index — sharded by doc_id (or by tenant for multi-tenant). Replicated for HA and read throughput. Field-level analysis (language, stemming, stop words) — different fields use different analyzers.
- 3Query / re-rank — coordinator fans out to shards, merges top-K, applies business signals (popularity, recency, personalisation) via function_score or a separate re-ranker.
Relevance (BM25 + signals): BM25 is the base score. Layer business signals on top: popularity boost (log1p(sales)), recency decay, personalisation. Never replace BM25 — boost it.
Rebuild-ability is non-optional. Analyzer change, schema change, bug — you'll rebuild. Use the alias pattern: products alias points at products-v4; rebuild v5 in background; flip alias atomically; keep v4 for rollback.
Canonical examples
- →Product search
- →Wiki / documentation search
- →Log exploration UI (Kibana)
- →Feed search / hashtag search
- →Support-ticket search
Decision levers
Engine choice
Postgres FTS up to ~10M docs and modest QPS. Meilisearch / Typesense for great DX at moderate scale. Elasticsearch / OpenSearch for billions + facets + custom scoring. Vespa when search IS the product and you need ML re-rank.
Ingestion pipeline
Never dual-write. Outbox or CDC (Debezium). App writes one DB transaction; CDC publishes to Kafka; indexer consumer upserts to search engine. Idempotent.
Analyzer per field
Free-text (name, description): standard + language stemming. Identifiers (category, SKU): keyword — un-analysed, exact match. Numeric: numeric fields for sort + filter. Wrong analyzer = zero matches.
Relevance strategy
BM25 base. function_score multiplies by popularity + recency. A/B test changes on CTR and conversion. Offline: ML re-rank on top of the top-K, for pros.
Failure modes
Elasticsearch has crashed and corrupted. Treat it as a derived view; never the only copy of the data. Rebuild from DB is the recovery path.
Couples write latency to indexer availability. Make it async via CDC/outbox; accept a few seconds of search lag.
Full table scan per query. Dies at hundreds of thousands of rows. Step up to Postgres FTS, then a real search engine.
Schema / analyzer change with no alias + rebuild pattern = downtime. Plan alias pattern from day 1.
Per-user re-ranking of every result blows cache and latency. Boost per-user at the top-K level, not at scan time.
Drills
Explain an inverted index in 30 seconds.Reveal
For every term, store a posting list of (doc_id, term_frequency, positions). A search looks up each term's list, intersects (AND) or unions (OR) doc_ids, scores with BM25, sorts, returns top K. The index is expensive to build, cheap to query — opposite of a btree.
You changed the analyzer. How to deploy?Reveal
Alias pattern. Create products-v4 with new analyzer. Bulk reindex from the source DB (or ES reindex API). Verify results. Flip products alias from v3 to v4 atomically. Keep v3 for 24h for rollback. Delete v3 after confidence.