Skip to content

Elasticsearch — Basics

Elasticsearch — Basics

What it is

Distributed search & analytics engine. Built on Apache Lucene.
JSON over HTTP API. Schemaless-ish (with mappings).
Use cases: full-text search, log/metric aggregation (ELK), geo search, vector search, observability.

Core concepts

Cluster — set of nodes.
Node — single ES instance. Roles: master, data, ingest, coordinating, ml.
Index — logical collection of documents (~ DB table).
Document — JSON record (~ row).
Shard — Lucene index. Primary + replicas.
Mapping — schema: field types, analyzers.

Inverted index

Core data structure for full-text search. Maps term → list of docs containing it.
Built per shard. Tokens come from analysis (analyzer).
Term dictionary + posting list. Posting list also stores positions (for phrase queries) and offsets (for highlighting).

Analyzer

Pipeline: char filters → tokenizer → token filters.

Char filter: strip HTML, replace patterns.
Tokenizer: splits text — standard, whitespace, keyword, pattern, ngram, edge_ngram.
Token filter: lowercase, stop, stemmer, synonym, asciifolding, ngram.

Standard analyzer: tokenize on word boundaries, lowercase, no stemming.

Field types (mapping)

text — analyzed, tokenized, full-text searchable, NOT sortable/aggregable directly.
keyword — exact value, sortable, aggregable, used for filters.
integer/long/short/byte/float/double/scaled_float.
date — ISO8601 or epoch.
boolean, geo_point, geo_shape, ip, binary.
nested — array of objects, queryable as separate docs.
object — flattened by default (loses array semantics).
dense_vector, sparse_vector — for KNN/vector search.

Dual-mapping pattern (multi-fields):

{ "title": { "type":"text", "fields": { "raw": { "type":"keyword" } } } }

Query DSL

Match — analyzed query (full-text).
Term — exact value (use on keyword/numeric, not text).
Range — gte/lte.
Bool — combine with must (AND, scoring), should (OR, scoring), filter (AND, no scoring), must_not.
Multi-match — query across multiple fields with weight (title^3).
Function score / Rank features — boost based on numeric fields, decay, script.
kNN — vector similarity (since 8.0).

{
  "query": {
    "bool": {
      "must": { "match": { "title": "rental dubai" } },
      "filter": [
        { "term": { "status": "active" } },
        { "range": { "price": { "lte": 5000 } } }
      ]
    }
  }
}

Aggregations

Bucket — group: terms, range, date_histogram, histogram, geohash_grid.
Metric — compute: avg, sum, min, max, cardinality, percentiles, stats.
Pipeline — agg over agg results (moving avg, derivative).

Cluster basics

Primary shard stores data; replicas for HA + read scaling.
Default: 1 primary, 1 replica (per index, since 7.x).
Allocation: master node assigns shards. Use cluster.routing.allocation.awareness for rack/zone awareness.
Refresh: in-memory buffer flushed to searchable segment every 1s (default). Near-real-time, not real-time.
Translog: write-ahead log for durability between segment flushes.

Common operations

PUT /products
PUT /products/_mapping { ... }
POST /products/_doc { "name":"x" }
PUT /products/_doc/123 { ... }
GET /products/_doc/123
POST /products/_update/123 { "doc": { "price": 99 } }
DELETE /products/_doc/123
POST /products/_search { "query": {...} }
POST /_bulk

Tooling

Kibana — UI, dashboards, dev tools.
Logstash, Beats (Filebeat, Metricbeat) — ingestion.
Ingest pipelines (in ES) — lightweight transform on index.
ILM (Index Lifecycle Management) — hot/warm/cold/frozen tiers + delete.