Elasticsearch – Elastic Stack Tutorial (Part 2)

Elasticsearch

Elasticsearch is a powerful search engine developed in Java with clients available for many of the major languages. Data is stored as JSON documents and are easily and quickly searched via an HTTP interface. Elasticsearch is popular to run together with Logstash for data-collecting and processing logs, and Kibana for visualizing the data. This is referred to as the Elastic Stack, and Elasticsearch functionality in the stack is to store the data and make it searchable.

This is the second part of the Elastic Stack tutorial, if you haven’t read the first article about Logstash yet, I recommend doing that before proceeding.

What is Elasticsearch and why is it used?

Elasticsearch is a very popular search engine used by many enterprises. The project is released as open source under the Apache License which means that it is completely free to use and very easy to get started with.

The main advantages of using Elasticsearch is the speed. It can process search queries near real-time, and with good write speeds too, it has become a very popular tool. Elasticsearch is also very easy to scale as it is a distributed system which scales horizontally, simply just add or remove nodes in the cluster as needed. The speed and the flexible scalability is possible because of sharding.

Shard

Documents are stored in indexes. For example, you could have an index for customer data, one for logs and one for events related to reporting. When you create an index you allocate how many initial shards an index should have. A shard can be thought of as a subindex, an index inside an index. Which enables your cluster to horizontally and split scale your cluster, and it also makes it possible to spread the work out on multiple shards simultaneously and thus increasing performance a lot.

Elasticsearch cluster with sharding

Shards are assigned to indexes when creating the index, and the only way to reconfigure it is to recreate the index and therefore it is very important to consider that you might have to scale up in the future.

How many shards should I assign per Elasticsearch index?

Let’s say that we created an index with only one primary shard, and later we realize that we have to scale to two nodes. The scaling to two nodes will have no effect at all because only one primary shard exists for the index. If we had assigned two primary shards, Elasticsearch would automatically move one of the shards to the second node. That sounds great, so you might think that it is a good idea to allocate many shards for an index? That is not always the case though. Each shard is a Lucene index which uses memory and CPU cycles. Additionally, when performing a search query, Elasticsearch will need to query all the shards, which is perfectly fine if these shards exist on separate nodes, but if too many are running on the same node the shards will compete for the same resources. Also, spreading out the data on too many shards, resulting in that each shard contains very little data, will lead to poor performance.

The default value of the number of shards per index is 5, which is an OK value in my opinion. It is fine to run on a single node, and at the same time, it makes it possible to scale to 5 nodes and take full advantage of the nodes. Additionally, even though it might seem like a hassle to reindex the documents, it is not likely to happen often that you all of the sudden decide to go from three nodes to, for example, 30, in this rare instance you might as well reindex the documents with more shards assigned to the index. However, if you think that there is even a slightly small chance that you will need to scale up at some point, it is definitely better to allocate a few extra shards so that you don’t end up in a painful corner.

Viewing data

If you followed the previous article about Logstash you should have been able to send data to Elasticsearch via Logstash. But how do we view the data? Well, Elasticsearch doesn’t actually ship with a native GUI. Instead, it provides a very robust REST interface for viewing and querying data. An example of a query can be seen below.

POST http://localhost:9200/INDEX_NAME/_search

The query above returns all documents inside the index. If you don’t know or remember you index name, you can also query for that.

GET http://localhost:9200/_cat/indices

For more information on the provided REST interface, see the official Elasticsearch documentation.

Even though the REST API is very robust, it can be nice to sometimes have access to a GUI, especially for viewing and checking the status of the cluster. There are a couple of third-party GUI’s developed and provided by the community. Popular options are for example cerebro, Elastic HQ and elasticsearch-head. But make sure you a GUI that works together with the version of Elasticsearch that you are running. In previous Elasticsearch versions, for example, 2.x it was popular to install GUIs as plugins in the Elasticsearch installation. However, lately, it has become far more popular to install it as a standalone server as a front-end, while Elasticsearch stays as a strict backend as it was supposed to be.

Final words

We have gone through some of the basics of how Elasticsearch works under the hood, and how to configure it appropriately in terms of indexes and shards. Sharding is very complex to optimize perfectly due to the nature of that you can’t always estimate how much you will need to scale before-hand. You don’t want to allocate too few shards because if you decide to scale up, and adding new nodes, these nodes will be more or less useless as they won’t be assigned any shards since there are none free. On the other end, you don’t want to go too extreme with shards since a shard is a Lucene index under the hood which requires resources. But the good news is that you can assign a few extra shards, so that if you think that you might need to scale to 5 nodes for the holidays, but normally only need 3, then it is perfectly fine to run with 5 shards all the time. Additionally, the option to re-index with more shards allocated is always an option, even though, you most likely don’t want to have to do that too often.

In the next article, we are going to look at Kibana and how we can use it to visualize our data.

Thanks for reading!

You may also like

Leave a Reply

Your email address will not be published. Required fields are marked *