Monitoring Elasticsearch Performance

Emily Chang has a big, four-part series on monitoring Elasticsearch performance.  Part 1 is a nice introduction to Elasticsearch and important metrics out of the box:

The three most common types of nodes in Elasticsearch are:

  • Master-eligible nodes: By default, every node is master-eligible unless otherwise specified. Each cluster automatically elects a master node from all of the master-eligible nodes. In the event that the current master node experiences a failure (such as a power outage, hardware failure, or an out-of-memory error), master-eligible nodes elect a new master. The master node is responsible for coordinating cluster tasks like distributing shards across nodes, and creating and deleting indices. Any master-eligible node is also able to function as a data node. However, in larger clusters, users may launch dedicated master-eligible nodes that do not store any data (by adding false to the config file), in order to improve reliability. In high-usage environments, moving the master role away from data nodes helps ensure that there will always be enough resources allocated to tasks that only master-eligible nodes can handle.

  • Data nodes: By default, every node is a data node that stores data in the form of shards (more about that in the section below) and performs actions related to indexing, searching, and aggregating data. In larger clusters, you may choose to create dedicated data nodes by addingnode.master: false to the config file, ensuring that these nodes have enough resources to handle data-related requests without the additional workload of cluster-related administrative tasks.

  • Client nodes: If you set node.master and to false, you will end up with a client node, which is designed to act as a load balancer that helps route indexing and search requests. Client nodes help shoulder some of the search workload so that data and master-eligible nodes can focus on their core tasks. Depending on your use case, client nodes may not be necessary because data nodes are able to handle request routing on their own. However, adding client nodes to your cluster makes sense if your search/index workload is heavy enough to benefit from having dedicated client nodes to help route requests.

Part 2 shows how to collect metrics using various APIs:

The Node Stats API is a powerful tool that provides access to nearly every metric from Part 1, with the exception of overall cluster health and pending tasks, which are only available via the Cluster Health API and the Pending Tasks API, respectively. The command to query the Node Stats API is:

curl localhost:9200/_nodes/stats

The output includes very detailed information about every node running in your cluster. You can also query a specific node by specifying the ID, address, name, or attribute of the node. In the command below, we are querying two nodes by their names, node1 and node2 ( in each node’s configuration file):

curl localhost:9200/_nodes/node1,node2/stats

Each node’s metrics are divided into several sections, listed here along with the metrics they contain from Part 1.

Part 3 is a brief for using Datadog for metrics collection and display:

The Datadog Agent is open source software that collects and reports metrics from each of your nodes, so you can view and monitor them in one place. Installing the Agent usually only takes a single command. View installation instructions for various platforms here. You can also install the Agent automatically with configuration management tools like Chef orPuppet.

Part 4 walks through some common Elasticsearch performance issues:

How to solve 5 Elasticsearch performance and scaling problemsseries /

This post is the final part of a 4-part series on monitoring Elasticsearch performance. Part 1 provides an overview of Elasticsearch and its key performance metrics, Part 2 explains how to collect these metrics, and Part 3 describes how to monitor Elasticsearch with Datadog.

Like a car, Elasticsearch was designed to allow its users to get up and running quickly, without having to understand all of its inner workings. However, it’s only a matter of time before you run into engine trouble here or there. This article will walk through five common Elasticsearch challenges, and how to deal with them.

Problem #1: My cluster status is red or yellow. What should I do?


If you recall from Part 1, cluster status is reported as red if one or more primary shards (and its replicas) is missing, and yellow if one or more replica shards is missing. Normally, this happens when a node drops off the cluster for whatever reason (hardware failure, long garbage collection time, etc.). Once the node recovers, its shards will remain in an initializing state before they transition back to active status.

The number of initializing shards typically peaks when a node rejoins the cluster, and then drops back down as the shards transition into an active state, as shown in the graph below.


During this initialization period, your cluster state may transition from green to yellow or red until the shards on the recovering node regain active status. In many cases, a brief status change to yellow or red may not require any action on your part.


However, if you notice that your cluster status is lingering in red or yellow state for an extended period of time, verify that the cluster is recognizing the correct number of Elasticsearch nodes, either by consulting Datadog’s dashboard or by querying the Cluster Health API detailed in Part 2.


If the number of active nodes is lower than expected, it means that at least one of your nodes lost its connection and hasn’t been able to rejoin the cluster. To find out which node(s) left the cluster, check the logs (located by default in the logs folder of your Elasticsearch home directory) for a line similar to the following:

[TIMESTAMP] ... Cluster health status changed from [GREEN] to RED

Reasons for node failure can vary, ranging from hardware or hypervisor failures, to out-of-memory errors. Check any of the monitoring tools outlined here for unusual changes in performance metrics that may have occurred around the same time the node failed, such as a sudden spike in the current rate of search or indexing requests. Once you have an idea of what may have happened, if it is a temporary failure, you can try to get the disconnected node(s) to recover and rejoin the cluster. If it is a permanent failure, and you are not able to recover the node, you can add new nodes and let Elasticsearch take care of recovering from any available replica shards; replica shards can be promoted to primary shards and redistributed on the new nodes you just added.

However, if you lost both the primary and replica copy of a shard, you can try to recover as much of the missing data as possible by using Elasticsearch’s snapshot and restore module. If you’re not already familiar with this module, it can be used to store snapshots of indices over time in a remote repository for backup purposes.

Problem #2: Help! Data nodes are running out of disk space

If all of your data nodes are running low on disk space, you will need to add more data nodes to your cluster. You will also need to make sure that your indices have enough primary shards to be able to balance their data across all those nodes.

However, if only certain nodes are running out of disk space, this is usually a sign that you initialized an index with too few shards. If an index is composed of a few very large shards, it’s hard for Elasticsearch to distribute these shards across nodes in a balanced manner.

This is the most thorough look at Elasticsearch internals that I’ve seen (although admittedly that’s not something I’m usually on the lookout for).

Analyzing The Simpsons

Todd Schneider has a fun analysis of the Simpsons:

Per Wikipedia:

While later seasons would focus on Homer, Bart was the lead character in most of the first three seasons

I’ve heard this argument before, that the show was originally about Bart before switching its focus to Homer, but the actual scripts only seem to partially support it.

Bart accounted for a significantly larger share of the show’s dialogue in season 1 than in any future season, but Homer’s share has always been higher than Bart’s. Dialogue share might not tell the whole story about a character’s prominence, but the fact is that Homer has always been the most talkative character on the show.

My reading is that it took a couple seasons for show writers to realize that Homer is the funniest character and that Bart’s character was too context-sensitive to be consistently funny.  It took quite a bit more time before merchandisers figured that out, to the extent that they ever did.

Using Pester To Validate Script Installations

Rob Sewell wants to use Pester to guarantee that he has Ola’s maintenance scripts installed on a server:

First I thought about what I would look for in SSMS when I had installed the maintenance solution and made a list of the things that I would check which looked something like this. This would be the checklist you would create (or have already created) for yourself or a junior following this install. This is how easy you can turn that checklist into a Pester Test and remove the human element and open your install for automated testing
  • SQL Server Agent is running – Otherwise the jobs won’t run🙂

  • We should have 4 backup jobs with a name of

  • DatabaseBackup – SYSTEM_DATABASES – FULL

  • DatabaseBackup – USER_DATABASES – FULL

  • DatabaseBackup – USER_DATABASES – DIFF

  • DatabaseBackup – USER_DATABASES – LOG

  • We should have Integrity Check and Index Optimisation Jobs

  • We should have the clean up jobs

  • All jobs should be scheduled

  • All jobs should be enabled

  • The jobs should have succeeded

There’s a very nice script and walkthrough of the process if you click through.

Using Xgboost In Azure ML Studio

Koos van Strien wants to use the xgboost model in Azure ML Studio:

Because the high-level path of bringing trained R models from the local R environment towards the cloud Azure ML is almost identical to the Python one I showed two weeks ago, I use the same four steps to guide you through the process:

  1. Export the trained model

  2. Zip the exported files

  3. Upload to the Azure ML environment

  4. Embed in your Azure ML solution

Read the whole thing.

Integrating Spark With Hive

Rahul Kumar wants to write Scala code to access the Hive datastore:

Hello geeks, we have discussed how to start programming with Spark in Scala. In this blog, we will discuss how we can use Hive with Spark 2.0.

When you start to work with Hive, you need HiveContext (inherits SqlContext), core-site.xml,hdfs-site.xml, and hive-site.xml for Spark. In case you don’t configure hive-site.xml then the context automatically creates metastore_db in the current directory and creates warehousedirectory indicated by HiveConf(which defaults user/hive/warehouse).

Rahul has made his demo code available on GitHub.

Choosing A Data Platform

Lukas Eder discusses when to use a relational database versus some non-relational database:

This question obviously assumes that you’re starting out with an RDBMS, which is classically the database system that solves pretty much any problem decently enough not to be replaced easily. What does this mean? Simply put:

  • RDBMS have been around forever, so they have a huge advantage compared to “newcomers” in the market, who don’t have all the excellent tooling, community, support, maturity yet
  • E.F. Codd’s work may have been the single biggest influence on our whole industry. There has hardly been anything as revolutionary as the relational model ever since. It’s hard for an alternative database to be equally universal, i.e. they’re mostly solving niche problems

Having said so, sometimes you do have a niche problem. For instance a graph database problem. In fact, a graph is nothing fundamentally different from what you can represent in the relational model. It is easy to model a graph with a many-to-many relationship table.

If you want a checklist, here’s how I would approach this question (ceteris paribus and limiting myself to about 100 words):

  1. Are you dealing with streaming millions of rows per second, or streaming from tens of thousands of endpoints concurrently?  Kafka and the Hadoop streaming stack.
  2. Is your problem something that you’ve already solved with a relational database, and your solution works well enough?  Relational database.
  3. Are there multiple “paths” to get to interesting data?  Relational database.
  4. Shopping carts (write-heavy, focused on availability over consistency)?  Riak/Cassandra/Dynamo at large scale, else relational database.
  5. Type duplication?  Relational database.
  6. Petabytes of data being analyzed asynchronously?  Hadoop.
  7. Other data platforms if they fit specific niche requirements around data structure.

There’s a lot more to this discussion than a simple numbered list, but I think it’s reasonable to start with relational databases and move away if and only if there’s a compelling reason.

SQL Server Configuration Section On Azure VM

Jack Li diagnoses an issue in which the SQL Server Configuration section of an Azure Virtual Machine only appeared under certain circumstances:

If you created an SQL Server VM via azure portal, there will be a section called “SQL Server Configuration” which was introduced via blog “Introducing a simplified configuration experience for SQL Server in Azure Virtual Machines”. Here is a screenshot of that setting.  It allows you to configure various things like auto backup, patching or storage etc.

I got a customer who created a SQL VM via powershell.  But that VM doesn’t have the section “SQL Server Configuration”.   Using his powershell script, I was able to reproduce the behavior.  When I created via portal UI, I got the “SQL Server Configuration”.

Read on for the solution.

Bug Scripting Sequences

Daniel Janik notices a small bug in how Management Studio scripts out sequences:

Have you ever scripted an object from SQL Server Management Studio (SSMS)? It does a really good job. You get nice cleanly formatted scripts that start with USE statements to select the database. They even have some simple comments?

Have you ever written a Sequence? Turns out if you script one you’ll notice that Microsoft left you an extra surprise. Double USE statements. Does it matter much? No. Should they fix it? Yes. I noticed this behavior when sequences were first released and it still exists in the latest version of SSMS for SQL 2016 (13.0.1500.23) as of this posting.

Yeah, that’s a tiny bug, but I can see it being annoying.


September 2016
« Aug Oct »