Skip to content
- Each shard = a Lucene index = many immutable segments.
- Indexing: docs go to in-memory buffer + translog. Refresh (1s default) flushes buffer to a new segment file → searchable.
- Updates: ES doesn’t update in place. Doc is marked deleted in old segment + new segment with new version. Background merge reclaims space.
- Merging is expensive — IO-heavy. Throttled.
- Default similarity: BM25 (replaced TF-IDF in 5.0).
- Components:
- TF (term frequency) — saturating: many occurrences boost less.
- IDF (inverse document frequency) — rare terms weighted higher.
- Field length norm — short fields with the term weighted higher.
_score returned per hit. Sortable by _score (default) or fields.
bool.filter doesn’t contribute to score (and is cacheable). Use it for facets/permission checks.
- Custom scoring:
function_score, script_score, rank_feature.
- Standard: most general, tokenizes on word boundaries, lowercase. No stemming.
- English: standard + English stop words + Porter stemmer.
- Common tokenizers:
keyword — no split (entire text as one token).
ngram(1,3) / edge_ngram — partial-match search-as-you-type.
- For autocomplete:
edge_ngram index analyzer + standard search analyzer (avoids ngram’ing query).
- Common pitfall: same analyzer on index and query, or you get phantom mismatches.
- Don’t use dynamic mapping in production for user-controlled fields → mapping explosion (every new field becomes part of mapping forever).
- Set
dynamic: 'strict' or use dynamic templates to constrain.
- Disable
_source only if you really don’t need it (lose reindex/update ability).
- Set
index: false on fields you only need to fetch, not search.
- Keyword vs text: use text for full-text, keyword for exact match/aggregations. Multi-field is the norm.
- Number of primary shards is fixed at index creation (changeable via reindex or splitting).
- Rule of thumb: shard size 20-50GB. Too many shards = cluster overhead. Too few = no parallelism / hot.
- Replicas can be changed dynamically.
- Time-series: roll over indices daily/weekly/monthly + ILM. Use data streams (7.9+).
- Client → coordinating node → query phase: send to one shard copy each → fetch phase: get top-K docs.
- “Query then Fetch” — first phase returns ids+scores; second fetches
_source.
search_type=dfs_query_then_fetch for cross-shard IDF accuracy (rarely needed).
- Near real-time: written docs become searchable on next refresh (1s default).
- Refresh interval can be tuned: -1 disables (bulk indexing), then refresh once.
- Durability: translog fsync per request (default) — change to async for higher throughput at risk.
- Replication: writes go to primary first, then replicated to in-sync replicas before ack.
- Why is
term query on text field unreliable? Text is analyzed (lowercased, stemmed). Term is exact. Use keyword subfield.
- Difference between filter and must? Filter has no scoring; cached; faster. Must scores.
- How to do exact phrase match?
match_phrase, with optional slop.
- Aggregation on text field fails — why? Text is not aggregable. Use
keyword subfield or fielddata: true (memory-heavy, avoid).
- How to handle synonyms? Synonym token filter at index or search time. Search-time more flexible.
- What’s
_source? Original JSON sent in. Stored separately from indexed terms. Disabling saves space but breaks update/reindex.
- How does ES handle concurrent updates? Optimistic via
_version or if_seq_no / if_primary_term. Conflict throws 409.
- When would you use nested vs flattened? Nested when queries on different sub-fields of same array element must correlate. Flattened (flat object map) when storing arbitrary deeply-nested user input cheaply.
- What is split-brain and how does ES prevent it? Use
discovery.zen.minimum_master_nodes = (N/2)+1 (legacy) or quorum-based master election (7.x+ uses Raft-like).
- Single huge index for all logs forever — split by time.
- Wildcard / regexp leading wildcard: full scan.
- Sorting on text field — needs fielddata, kills heap.
- Deeply nested JSON structures with
nested — combinatorial explosion.
- Using ES as primary store — it’s a search engine, expect re-indexing from source of truth.
dense_vector field with index: true enables HNSW kNN.
- Combine with filters via
knn query.
- Use for semantic search alongside BM25 (hybrid: RRF reciprocal rank fusion).
- Heap: 50% of RAM, capped at 31GB (compressed oops).
- Disk watermarks: low (85%), high (90%), flood (95%) — read-only above flood.
- Snapshots to S3/GCS via repository plugin. Incremental.
- Reindex API for mapping changes — blue/green index swap via aliases.