Grafana On Elasticsearch

Daniel Berman shows how to replace Kibana with Grafana:

While very similar in terms of what can be done with the data itself within the two tools. The main differences between Kibana and Grafana lie in configuring how the data is displayed. Grafana has richer display features and more options for playing around with how the data is represented in the graphs.

While it takes some time getting accustomed to building graphs in Grafana — especially if you’re coming from Kibana — the data displayed in Grafana dashboards can be read and analyzed more easily.

I prefer Grafana over Kibana for a few reasons, so I’m happy to see Grafana articles popping up.

Logstash Filters

Nicolas Frankel explains how the grok and dissect filters work in Logstash:

The Grok filter gets the job done. But it seems to suffer from performance issues, especially if the pattern doesn’t match. An alternative is to use the dissect filter instead, which is based on separators.

Unfortunately, there’s no app for that – but it’s much easier to write a separator-based filter than a regex-based one. The mapping equivalent to the above is:

%{timestamp} %{+timestamp} %{level}[%{application},%{traceId},%{spanId},%{zipkin}]\n
%{pid} %{}[%{thread}] %{class}:%{log}
(broken on 2 lines for better readability)

One of the big secrets to effective debugging of code is having good logging mechanisms in place.

Ways To Crash Elasticsearch

Roi Ravhon shows how to take down an Elasticsearch instance:

Cardinality aggregation is used to count distinct values in a data set. For example, if you want to know the number of IPs used in your system, you can use this aggregation on an IP field and then count the results.

Despite the usefulness, cardinality can also be a touchy Elasticsearch feature to use. Performing a unique count on a field with a multitude of possible values when configuring a visualization, for example, can bring Elasticsearch to a halt.

Most of it comes down to writing good queries.  But if you don’t know what good Elasticsearch queries look like, read on.

Elasticsearch 5.0

Itamar Syn-hershko looks at the new functionality in the latest version of Elasticsearch:

One fundamental feature of Elasticsearch is scoring – or results ranking by relevance. The part that handles it is a Lucene component called Similarity. ES 5.0 now makes Okapi BM25 the default similarity and that’s quite an important change. The default has long been tf/idf, which is both simpler to understand but easier to be fooled by rogue results. BM25 is a probabalistic approach to ranking that almost always gives better results than the more vanilla tf/idf. I’ve been recommending customers to use BM25 over tf/idf for a long time now, and we also rely on it at Forter for doing quite a lot of interesting stuff. Overall, a good move by ES and I can finally archive a year’s long advise. Britta Weber has a great talk on explaining the difference, and BM25 in particular, definitely a recommended watch.

This is one of several search-related features in the latest version.  Looks like a solid release.

Unassigned Shards In Elasticsearch

Emily Chang shows how to find and fix unassigned shards in Elasticsearch:

As nodes join and leave the cluster, the master node reassigns shards automatically, ensuring that multiple copies of a shard aren’t assigned to the same node. In other words, the master node will not assign a primary shard to the same node as its replica, nor will it assign two replicas of the same shard to the same node. A shard may linger in an unassigned state if there are not enough nodes to distribute the shards accordingly.

To avoid this issue, make sure that every index in your cluster is initialized with fewer replicas per primary shard than the number of nodes in your cluster by following the formula below:
N >= R + 1

Where N is the number of nodes in your cluster, and R is the largest shard replication factor across all indices in your cluster.

Read the whole thing if you’re an Elasticsearch administrator.

Regular Expressions In Lucene

Kendra Little looks at Azure Search searches:

I wanted to be able to find all architect jobs using something like ‘%rchit%’ as well, because there’s not a lot of great ways to do this in SQL Server.

In SQL Server, you can use a traditional B-Tree index to seek, but only based on the letters at the beginning of a character column.  If I want to know every business title that contains ‘%rchit%’, I’m going to have to scan an entire index.

SQL Server fulltext indexes don’t solve the double-wildcard problem, either. Fulltext indexes support word prefix searches– so a fulltext index would be great at finding all job titles that contain a word that starts with ‘Arch%’.

Sometimes that’s enough. But a lot of times, you do need to find a substring anywhere in a word. And sometimes you do want to offload that from your database.

This is the kind of problem Lucene (and its follow-up implementations, like Elasticsearch) was designed to solve.  Read on for more details as Kendra solves the problem in Azure Search.

Elasticsearch Write Operations

Kunal Kapoor has a presentation on Elasticsearch write operations (inserts, updates, and deletes) and explains what’s going on:

In this presentation, we are going to discuss how Elasticsearch handles the various operations like insert, update, delete. We would also cover what is an inverted index and how segment merging works.

Click through for the slides; they helped me firm up a few thoughts I had about Elasticsearch.

Monitoring Elasticsearch Performance

Emily Chang has a big, four-part series on monitoring Elasticsearch performance.  Part 1 is a nice introduction to Elasticsearch and important metrics out of the box:

The three most common types of nodes in Elasticsearch are:

  • Master-eligible nodes: By default, every node is master-eligible unless otherwise specified. Each cluster automatically elects a master node from all of the master-eligible nodes. In the event that the current master node experiences a failure (such as a power outage, hardware failure, or an out-of-memory error), master-eligible nodes elect a new master. The master node is responsible for coordinating cluster tasks like distributing shards across nodes, and creating and deleting indices. Any master-eligible node is also able to function as a data node. However, in larger clusters, users may launch dedicated master-eligible nodes that do not store any data (by adding false to the config file), in order to improve reliability. In high-usage environments, moving the master role away from data nodes helps ensure that there will always be enough resources allocated to tasks that only master-eligible nodes can handle.

  • Data nodes: By default, every node is a data node that stores data in the form of shards (more about that in the section below) and performs actions related to indexing, searching, and aggregating data. In larger clusters, you may choose to create dedicated data nodes by addingnode.master: false to the config file, ensuring that these nodes have enough resources to handle data-related requests without the additional workload of cluster-related administrative tasks.

  • Client nodes: If you set node.master and to false, you will end up with a client node, which is designed to act as a load balancer that helps route indexing and search requests. Client nodes help shoulder some of the search workload so that data and master-eligible nodes can focus on their core tasks. Depending on your use case, client nodes may not be necessary because data nodes are able to handle request routing on their own. However, adding client nodes to your cluster makes sense if your search/index workload is heavy enough to benefit from having dedicated client nodes to help route requests.

Part 2 shows how to collect metrics using various APIs:

The Node Stats API is a powerful tool that provides access to nearly every metric from Part 1, with the exception of overall cluster health and pending tasks, which are only available via the Cluster Health API and the Pending Tasks API, respectively. The command to query the Node Stats API is:

curl localhost:9200/_nodes/stats

The output includes very detailed information about every node running in your cluster. You can also query a specific node by specifying the ID, address, name, or attribute of the node. In the command below, we are querying two nodes by their names, node1 and node2 ( in each node’s configuration file):

curl localhost:9200/_nodes/node1,node2/stats

Each node’s metrics are divided into several sections, listed here along with the metrics they contain from Part 1.

Part 3 is a brief for using Datadog for metrics collection and display:

The Datadog Agent is open source software that collects and reports metrics from each of your nodes, so you can view and monitor them in one place. Installing the Agent usually only takes a single command. View installation instructions for various platforms here. You can also install the Agent automatically with configuration management tools like Chef orPuppet.

Part 4 walks through some common Elasticsearch performance issues:

How to solve 5 Elasticsearch performance and scaling problemsseries /

This post is the final part of a 4-part series on monitoring Elasticsearch performance. Part 1 provides an overview of Elasticsearch and its key performance metrics, Part 2 explains how to collect these metrics, and Part 3 describes how to monitor Elasticsearch with Datadog.

Like a car, Elasticsearch was designed to allow its users to get up and running quickly, without having to understand all of its inner workings. However, it’s only a matter of time before you run into engine trouble here or there. This article will walk through five common Elasticsearch challenges, and how to deal with them.

Problem #1: My cluster status is red or yellow. What should I do?


If you recall from Part 1, cluster status is reported as red if one or more primary shards (and its replicas) is missing, and yellow if one or more replica shards is missing. Normally, this happens when a node drops off the cluster for whatever reason (hardware failure, long garbage collection time, etc.). Once the node recovers, its shards will remain in an initializing state before they transition back to active status.

The number of initializing shards typically peaks when a node rejoins the cluster, and then drops back down as the shards transition into an active state, as shown in the graph below.


During this initialization period, your cluster state may transition from green to yellow or red until the shards on the recovering node regain active status. In many cases, a brief status change to yellow or red may not require any action on your part.


However, if you notice that your cluster status is lingering in red or yellow state for an extended period of time, verify that the cluster is recognizing the correct number of Elasticsearch nodes, either by consulting Datadog’s dashboard or by querying the Cluster Health API detailed in Part 2.


If the number of active nodes is lower than expected, it means that at least one of your nodes lost its connection and hasn’t been able to rejoin the cluster. To find out which node(s) left the cluster, check the logs (located by default in the logs folder of your Elasticsearch home directory) for a line similar to the following:

[TIMESTAMP] ... Cluster health status changed from [GREEN] to RED

Reasons for node failure can vary, ranging from hardware or hypervisor failures, to out-of-memory errors. Check any of the monitoring tools outlined here for unusual changes in performance metrics that may have occurred around the same time the node failed, such as a sudden spike in the current rate of search or indexing requests. Once you have an idea of what may have happened, if it is a temporary failure, you can try to get the disconnected node(s) to recover and rejoin the cluster. If it is a permanent failure, and you are not able to recover the node, you can add new nodes and let Elasticsearch take care of recovering from any available replica shards; replica shards can be promoted to primary shards and redistributed on the new nodes you just added.

However, if you lost both the primary and replica copy of a shard, you can try to recover as much of the missing data as possible by using Elasticsearch’s snapshot and restore module. If you’re not already familiar with this module, it can be used to store snapshots of indices over time in a remote repository for backup purposes.

Problem #2: Help! Data nodes are running out of disk space

If all of your data nodes are running low on disk space, you will need to add more data nodes to your cluster. You will also need to make sure that your indices have enough primary shards to be able to balance their data across all those nodes.

However, if only certain nodes are running out of disk space, this is usually a sign that you initialized an index with too few shards. If an index is composed of a few very large shards, it’s hard for Elasticsearch to distribute these shards across nodes in a balanced manner.

This is the most thorough look at Elasticsearch internals that I’ve seen (although admittedly that’s not something I’m usually on the lookout for).

Securing Elasticsearch And Kibana

Vikash Selvin shows how to secure instances of Elasticsearch and Kibana:

The most popular options for securing Elasticsearch and Kibana are compared in the table below.

Shield is a security plugin developed by the same company that developed Elasticsearch. It allows you to easily protect this data with a username and password while simplifying your architecture. Advanced security features like encryption, role-based access control, IP filtering, and auditing are also available when you need them.

NGINX is an open source web server. It can act as a proxy server and can do load balancing, among other things. In combination with LUA and external scripts, it can be used for securing Elasticsearch and Kibana. We will be using this approach in this tutorial.

Searchguard is an open source alternative for Shield. It provides almost all the same functionalities as Shield, except for some features like LDAP authentication. However, these features are available in the paid variant.

Click through for a detailed NGINX setup.

Jepsen: Crate

Kyle Kingsbury checks out Crate, a SQL database built on Elasticsearch:

Building a database on Elasticsearch is something of a double-edged sword. Crate has been able to focus on hard problems like query planning, joins, aggregations, and so on–without having to take on the tough work of building a storage layer, cluster membership, replication algorithm, etc. However, Crate is tightly coupled to Elasticsearch, and is dependent on the Elastic team for improvements to that technology. Elasticsearch’s consistency issues have been well-known for years, and the process to fix them is still ongoing. It’s not clear what Crate can do to get out of this situation: a rewrite would be complex and expensive (and introduce new and unknown failure modes), whereas fixing Elasticsearch’s consistency problems could easily consume person-years of engineering time that a small company can ill-afford.

There are good reasons to use Crate: distributed SQL stores, especially with Crate’s capacity for aggregations and joins, are hard to come by. Moreover, Crate introduces several helpful features not present in Elasticsearch. That said, the risk of data loss is real, and is unlikely to be resolved at any point in the near future. I recommend that Crate users avoid using Crate as their system of record–at least, where each record matters. Like Elasticsearch itself, you should use a safer database as your primary store, and continuously backfill data from that primary store into Crate for querying. Crate may also be suitable for cases where occasional data loss or corruption does is mostly harmless, e.g. high-volume sensor data, observability, analytics, etc.

Every time the Jepsen series gets updated, I make time to read.


April 2017
« Mar