Elasticsearch — Theory

Elasticsearch — Theory (interview deep-dive)

Each shard = a Lucene index = many immutable segments.
Indexing: docs go to in-memory buffer + translog. Refresh (1s default) flushes buffer to a new segment file → searchable.
Updates: ES doesn’t update in place. Doc is marked deleted in old segment + new segment with new version. Background merge reclaims space.
Merging is expensive — IO-heavy. Throttled.

Default similarity: BM25 (replaced TF-IDF in 5.0).
Components:
- TF (term frequency) — saturating: many occurrences boost less.
- IDF (inverse document frequency) — rare terms weighted higher.
- Field length norm — short fields with the term weighted higher.
_score returned per hit. Sortable by _score (default) or fields.
bool.filter doesn’t contribute to score (and is cacheable). Use it for facets/permission checks.
Custom scoring: function_score, script_score, rank_feature.

Standard: most general, tokenizes on word boundaries, lowercase. No stemming.
English: standard + English stop words + Porter stemmer.
Common tokenizers:
- keyword — no split (entire text as one token).
- ngram(1,3) / edge_ngram — partial-match search-as-you-type.
For autocomplete: edge_ngram index analyzer + standard search analyzer (avoids ngram’ing query).
Common pitfall: same analyzer on index and query, or you get phantom mismatches.

Don’t use dynamic mapping in production for user-controlled fields → mapping explosion (every new field becomes part of mapping forever).
Set dynamic: 'strict' or use dynamic templates to constrain.
Disable _source only if you really don’t need it (lose reindex/update ability).
Set index: false on fields you only need to fetch, not search.
Keyword vs text: use text for full-text, keyword for exact match/aggregations. Multi-field is the norm.

Number of primary shards is fixed at index creation (changeable via reindex or splitting).
Rule of thumb: shard size 20-50GB. Too many shards = cluster overhead. Too few = no parallelism / hot.
Replicas can be changed dynamically.
Time-series: roll over indices daily/weekly/monthly + ILM. Use data streams (7.9+).

Client → coordinating node → query phase: send to one shard copy each → fetch phase: get top-K docs.
“Query then Fetch” — first phase returns ids+scores; second fetches _source.
search_type=dfs_query_then_fetch for cross-shard IDF accuracy (rarely needed).

Near real-time: written docs become searchable on next refresh (1s default).
Refresh interval can be tuned: -1 disables (bulk indexing), then refresh once.
Durability: translog fsync per request (default) — change to async for higher throughput at risk.
Replication: writes go to primary first, then replicated to in-sync replicas before ack.

Why is term query on text field unreliable? Text is analyzed (lowercased, stemmed). Term is exact. Use keyword subfield.
Difference between filter and must? Filter has no scoring; cached; faster. Must scores.
How to do exact phrase match? match_phrase, with optional slop.
Aggregation on text field fails — why? Text is not aggregable. Use keyword subfield or fielddata: true (memory-heavy, avoid).
How to handle synonyms? Synonym token filter at index or search time. Search-time more flexible.
What’s _source? Original JSON sent in. Stored separately from indexed terms. Disabling saves space but breaks update/reindex.
How does ES handle concurrent updates? Optimistic via _version or if_seq_no / if_primary_term. Conflict throws 409.
When would you use nested vs flattened? Nested when queries on different sub-fields of same array element must correlate. Flattened (flat object map) when storing arbitrary deeply-nested user input cheaply.
What is split-brain and how does ES prevent it? Use discovery.zen.minimum_master_nodes = (N/2)+1 (legacy) or quorum-based master election (7.x+ uses Raft-like).

Single huge index for all logs forever — split by time.
Wildcard / regexp leading wildcard: full scan.
Sorting on text field — needs fielddata, kills heap.
Deeply nested JSON structures with nested — combinatorial explosion.
Using ES as primary store — it’s a search engine, expect re-indexing from source of truth.

Heap: 50% of RAM, capped at 31GB (compressed oops).
Disk watermarks: low (85%), high (90%), flood (95%) — read-only above flood.
Snapshots to S3/GCS via repository plugin. Incremental.
Reindex API for mapping changes — blue/green index swap via aliases.