Elasticsearch Shards & Apache Lucene
3 min readAug 25, 2024
Apache Lucene
- High performance text search engine library.
- Written in Java.
- Open-source.
1. How Lucene stores data for very fast queries and aggregation?
- It maintains inverted index, which is a data structure that tracks which documents contains certains values.
- According to this inverted index, when you query for documents containing the word ‘name’, instead of reading in every documents, check the inverted index and get all the documents that contain the word.
- Lucene has all the information, how many documents match(Frequency column) and which specific documents(Document Id column).
2. Lucene Analyzer
- Standard Analyzer- removes stop words, lowercases, tokenizes, recognizes emails and URLs.
- Simple Analyzer- lowercase and tokenizes
- Stop Analyzer- lowercases, tokenizes, splits by non letter characters, remove stop words.
- White space Analyzer- Splits by white space character.
- Keyword Analyzer- Entire sentence is a single token.
- Language Analyzer- Understands English, French and Spanish
Elasticsearch Shards
1. Primary Shards
- Scaling Lucene horizontally.
- Shard is a Lucene index defined and stored within a node.(being an instance of Lucene)
- More Shards means more Lucene in same node. It means it consumes more system resource.
- Elasticsearch adds more shards to scale appropriately and high performance.
2. Replication Shards
- First data is written to the primary shards and then replicated to one or more replica shards.
- Replica shards can never live on the same node as their primary shards.
Example- Here I am using one node docker cluster.
GET https://internal-dev.pageroonline.com:14443/elasticsearch/_cluster/allocation/explain?filter_path=index,node_allocation_decisions.node_name,node_allocation_decisions.deciders.*
{
"index": "test-000001",
"shard": 0,
"primary": false
}
Result-
Replica shard is unassigned because Elasticsearch is not allowed to create replica shard in the same node.
{
"index": "test-000001",
"node_allocation_decisions": [
{
"node_name": "elasticsearch-81",
"deciders": [
{
"decider": "same_shard",
"decision": "NO",
"explanation": "a copy of this shard is already allocated to this node [[test-000001][0], node[FqEoVvqYRrCHFtgaT5LC6Q], [P], s[STARTED], a[id=kBrMU1W-TgGLunVYj-Z7Xg], failed_attempts[0]]"
}
]
}
]
}
- Replica shards can accept only read requests.
- Will always be completely up to date with the original primary shards.
- We can add additional replica shards into the existing index. Therefore increasing the read bandwidth.
- Replica shard is promoted as primary shard when the primary shard is failed. It will handle by Elasticsearch.
Elasticsearch ensures fault tolerance by running multiple nodes and using replica shards.
Index
- a group of shards where each shard is holding part of the same dataset.
The cluster shard limit defaults to 1000 shards per non-frozen data node for normal (non-frozen) indices.
3000 shards per frozen data node for frozen indices.
Both primary and replica shards of all open indices count toward the limit, including unassigned shards.
References
https://www.elastic.co/guide/en/elasticsearch/reference/current/misc-cluster-settings.html#cluster-shard-limit
https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html