In this presentation, we are going to discuss how Elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.
Click through for the slides; they helped me firm up a few thoughts I had about Elasticsearch.
Emily Chang has a big, four-part series on monitoring Elasticsearch performance. Part 1 is a nice introduction to Elasticsearch and important metrics out of the box:
The three most common types of nodes in Elasticsearch are:
Master-eligible nodes: By default, every node is master-eligible unless otherwise specified. Each cluster automatically elects a master node from all of the master-eligible nodes. In the event that the current master node experiences a failure (such as a power outage, hardware failure, or an out-of-memory error), master-eligible nodes elect a new master. The master node is responsible for coordinating cluster tasks like distributing shards across nodes, and creating and deleting indices. Any master-eligible node is also able to function as a data node. However, in larger clusters, users may launch dedicated master-eligible nodes that do not store any data (by adding
node.data: falseto the config file), in order to improve reliability. In high-usage environments, moving the master role away from data nodes helps ensure that there will always be enough resources allocated to tasks that only master-eligible nodes can handle.
Data nodes: By default, every node is a data node that stores data in the form of shards (more about that in the section below) and performs actions related to indexing, searching, and aggregating data. In larger clusters, you may choose to create dedicated data nodes by adding
node.master: falseto the config file, ensuring that these nodes have enough resources to handle data-related requests without the additional workload of cluster-related administrative tasks.
Client nodes: If you set
node.datato false, you will end up with a client node, which is designed to act as a load balancer that helps route indexing and search requests. Client nodes help shoulder some of the search workload so that data and master-eligible nodes can focus on their core tasks. Depending on your use case, client nodes may not be necessary because data nodes are able to handle request routing on their own. However, adding client nodes to your cluster makes sense if your search/index workload is heavy enough to benefit from having dedicated client nodes to help route requests.
The Node Stats API is a powerful tool that provides access to nearly every metric from Part 1, with the exception of overall cluster health and pending tasks, which are only available via the Cluster Health API and the Pending Tasks API, respectively. The command to query the Node Stats API is:
The output includes very detailed information about every node running in your cluster. You can also query a specific node by specifying the ID, address, name, or attribute of the node. In the command below, we are querying two nodes by their names, node1 and node2 (
node.namein each node’s configuration file):
Each node’s metrics are divided into several sections, listed here along with the metrics they contain from Part 1.
The Datadog Agent is open source software that collects and reports metrics from each of your nodes, so you can view and monitor them in one place. Installing the Agent usually only takes a single command. View installation instructions for various platforms here. You can also install the Agent automatically with configuration management tools like Chef orPuppet.
How to solve 5 Elasticsearch performance and scaling problems
This post is the final part of a 4-part series on monitoring Elasticsearch performance. Part 1 provides an overview of Elasticsearch and its key performance metrics, Part 2 explains how to collect these metrics, and Part 3 describes how to monitor Elasticsearch with Datadog.
Like a car, Elasticsearch was designed to allow its users to get up and running quickly, without having to understand all of its inner workings. However, it’s only a matter of time before you run into engine trouble here or there. This article will walk through five common Elasticsearch challenges, and how to deal with them.
If you recall from Part 1, cluster status is reported as red if one or more primary shards (and its replicas) is missing, and yellow if one or more replica shards is missing. Normally, this happens when a node drops off the cluster for whatever reason (hardware failure, long garbage collection time, etc.). Once the node recovers, its shards will remain in an initializing state before they transition back to active status.
The number of initializing shards typically peaks when a node rejoins the cluster, and then drops back down as the shards transition into an active state, as shown in the graph below.
During this initialization period, your cluster state may transition from green to yellow or red until the shards on the recovering node regain active status. In many cases, a brief status change to yellow or red may not require any action on your part.
However, if you notice that your cluster status is lingering in red or yellow state for an extended period of time, verify that the cluster is recognizing the correct number of Elasticsearch nodes, either by consulting Datadog’s dashboard or by querying the Cluster Health API detailed in Part 2.
If the number of active nodes is lower than expected, it means that at least one of your nodes lost its connection and hasn’t been able to rejoin the cluster. To find out which node(s) left the cluster, check the logs (located by default in the
logs folder of your Elasticsearch home directory) for a line similar to the following:
[TIMESTAMP] ... Cluster health status changed from [GREEN] to RED
Reasons for node failure can vary, ranging from hardware or hypervisor failures, to out-of-memory errors. Check any of the monitoring tools outlined here for unusual changes in performance metrics that may have occurred around the same time the node failed, such as a sudden spike in the current rate of search or indexing requests. Once you have an idea of what may have happened, if it is a temporary failure, you can try to get the disconnected node(s) to recover and rejoin the cluster. If it is a permanent failure, and you are not able to recover the node, you can add new nodes and let Elasticsearch take care of recovering from any available replica shards; replica shards can be promoted to primary shards and redistributed on the new nodes you just added.
However, if you lost both the primary and replica copy of a shard, you can try to recover as much of the missing data as possible by using Elasticsearch’s snapshot and restore module. If you’re not already familiar with this module, it can be used to store snapshots of indices over time in a remote repository for backup purposes.
If all of your data nodes are running low on disk space, you will need to add more data nodes to your cluster. You will also need to make sure that your indices have enough primary shards to be able to balance their data across all those nodes.
However, if only certain nodes are running out of disk space, this is usually a sign that you initialized an index with too few shards. If an index is composed of a few very large shards, it’s hard for Elasticsearch to distribute these shards across nodes in a balanced manner.
This is the most thorough look at Elasticsearch internals that I’ve seen (although admittedly that’s not something I’m usually on the lookout for).
The most popular options for securing Elasticsearch and Kibana are compared in the table below.
Shield is a security plugin developed by the same company that developed Elasticsearch. It allows you to easily protect this data with a username and password while simplifying your architecture. Advanced security features like encryption, role-based access control, IP filtering, and auditing are also available when you need them.
NGINX is an open source web server. It can act as a proxy server and can do load balancing, among other things. In combination with LUA and external scripts, it can be used for securing Elasticsearch and Kibana. We will be using this approach in this tutorial.
Searchguard is an open source alternative for Shield. It provides almost all the same functionalities as Shield, except for some features like LDAP authentication. However, these features are available in the paid variant.
Click through for a detailed NGINX setup.
Building a database on Elasticsearch is something of a double-edged sword. Crate has been able to focus on hard problems like query planning, joins, aggregations, and so on–without having to take on the tough work of building a storage layer, cluster membership, replication algorithm, etc. However, Crate is tightly coupled to Elasticsearch, and is dependent on the Elastic team for improvements to that technology. Elasticsearch’s consistency issues have been well-known for years, and the process to fix them is still ongoing. It’s not clear what Crate can do to get out of this situation: a rewrite would be complex and expensive (and introduce new and unknown failure modes), whereas fixing Elasticsearch’s consistency problems could easily consume person-years of engineering time that a small company can ill-afford.
There are good reasons to use Crate: distributed SQL stores, especially with Crate’s capacity for aggregations and joins, are hard to come by. Moreover, Crate introduces several helpful features not present in Elasticsearch. That said, the risk of data loss is real, and is unlikely to be resolved at any point in the near future. I recommend that Crate users avoid using Crate as their system of record–at least, where each record matters. Like Elasticsearch itself, you should use a safer database as your primary store, and continuously backfill data from that primary store into Crate for querying. Crate may also be suitable for cases where occasional data loss or corruption does is mostly harmless, e.g. high-volume sensor data, observability, analytics, etc.
Every time the Jepsen series gets updated, I make time to read.
When you are looking for a node or a cluster, it is a good idea to have a name which describes what you will need to find and separates one from another.
The default cluster name of “elasticsearch ” could allow any nodes to join the cluster, even if this was not intended. Thus, it is a good idea to give the cluster a distinct identifier instead.
There are a few things to think about in this post.
Kibana is the natural UI choice for partnering Elasticsearch, and it has the advantage of being Web-based and Dockerized, so it’s cross-platform and easy to share. But PowerBI is a lot more powerful, and the multitude of available connectors mean it’s easy to build a single dashboard which pulls data from multiple sources.
Using Elasticsearch for one of those sources is simple, although it will need some custom work to query your indexes and navigate the documents to get the field you want. You can even publish your reports to PowerBI in the cloud and limit access using Azure Active Directory – which gives you a nice, integrated security story.
I tend to be very hard on Kibana, particularly because it makes the easy stuff easy and the hard stuff impossible, so I think that this is an interesting alternative to Kibana.
On the top level hits.total is the total number of the docs using an empty search query, and max_score is the maximum score a document can take in a specific query. In our case it’s one, since no query was specified.
In __shards.total_ the value is the number of Lucene indexes that Elasticsearch created for that index. The default number is always 5 unless we specify otherwise on index creation time. More details about shards are explained here.
ElasticSearch is designed to store things like logs and monitoring metrics, and the interface is JSON. This makes it very useful for certain tasks and infuriatingly difficult to do other things (like advanced aggregations). Still, in a medium-sized or larger environment, this is probably a technology you either are using today or want to use.