Press "Enter" to skip to content

Category: Administration

Monitoring Elasticsearch Performance

Emily Chang has a big, four-part series on monitoring Elasticsearch performance.  Part 1 is a nice introduction to Elasticsearch and important metrics out of the box:

The three most common types of nodes in Elasticsearch are:

  • Master-eligible nodes: By default, every node is master-eligible unless otherwise specified. Each cluster automatically elects a master node from all of the master-eligible nodes. In the event that the current master node experiences a failure (such as a power outage, hardware failure, or an out-of-memory error), master-eligible nodes elect a new master. The master node is responsible for coordinating cluster tasks like distributing shards across nodes, and creating and deleting indices. Any master-eligible node is also able to function as a data node. However, in larger clusters, users may launch dedicated master-eligible nodes that do not store any data (by adding node.data: false to the config file), in order to improve reliability. In high-usage environments, moving the master role away from data nodes helps ensure that there will always be enough resources allocated to tasks that only master-eligible nodes can handle.

  • Data nodes: By default, every node is a data node that stores data in the form of shards (more about that in the section below) and performs actions related to indexing, searching, and aggregating data. In larger clusters, you may choose to create dedicated data nodes by addingnode.master: false to the config file, ensuring that these nodes have enough resources to handle data-related requests without the additional workload of cluster-related administrative tasks.

  • Client nodes: If you set node.master and node.data to false, you will end up with a client node, which is designed to act as a load balancer that helps route indexing and search requests. Client nodes help shoulder some of the search workload so that data and master-eligible nodes can focus on their core tasks. Depending on your use case, client nodes may not be necessary because data nodes are able to handle request routing on their own. However, adding client nodes to your cluster makes sense if your search/index workload is heavy enough to benefit from having dedicated client nodes to help route requests.

Part 2 shows how to collect metrics using various APIs:

The Node Stats API is a powerful tool that provides access to nearly every metric from Part 1, with the exception of overall cluster health and pending tasks, which are only available via the Cluster Health API and the Pending Tasks API, respectively. The command to query the Node Stats API is:

curl localhost:9200/_nodes/stats

The output includes very detailed information about every node running in your cluster. You can also query a specific node by specifying the ID, address, name, or attribute of the node. In the command below, we are querying two nodes by their names, node1 and node2 (node.name in each node’s configuration file):

curl localhost:9200/_nodes/node1,node2/stats

Each node’s metrics are divided into several sections, listed here along with the metrics they contain from Part 1.

Part 3 is a brief for using Datadog for metrics collection and display:

The Datadog Agent is open source software that collects and reports metrics from each of your nodes, so you can view and monitor them in one place. Installing the Agent usually only takes a single command. View installation instructions for various platforms here. You can also install the Agent automatically with configuration management tools like Chef orPuppet.

Part 4 walks through some common Elasticsearch performance issues:

How to solve 5 Elasticsearch performance and scaling problemsseries /

This post is the final part of a 4-part series on monitoring Elasticsearch performance. Part 1 provides an overview of Elasticsearch and its key performance metrics, Part 2 explains how to collect these metrics, and Part 3 describes how to monitor Elasticsearch with Datadog.

Like a car, Elasticsearch was designed to allow its users to get up and running quickly, without having to understand all of its inner workings. However, it’s only a matter of time before you run into engine trouble here or there. This article will walk through five common Elasticsearch challenges, and how to deal with them.

Problem #1: My cluster status is red or yellow. What should I do?

es-cluster-status.png

If you recall from Part 1, cluster status is reported as red if one or more primary shards (and its replicas) is missing, and yellow if one or more replica shards is missing. Normally, this happens when a node drops off the cluster for whatever reason (hardware failure, long garbage collection time, etc.). Once the node recovers, its shards will remain in an initializing state before they transition back to active status.

The number of initializing shards typically peaks when a node rejoins the cluster, and then drops back down as the shards transition into an active state, as shown in the graph below.

initializing-shards.png

During this initialization period, your cluster state may transition from green to yellow or red until the shards on the recovering node regain active status. In many cases, a brief status change to yellow or red may not require any action on your part.

node-status.png

However, if you notice that your cluster status is lingering in red or yellow state for an extended period of time, verify that the cluster is recognizing the correct number of Elasticsearch nodes, either by consulting Datadog’s dashboard or by querying the Cluster Health API detailed in Part 2.

es-num-of-nodes.png

If the number of active nodes is lower than expected, it means that at least one of your nodes lost its connection and hasn’t been able to rejoin the cluster. To find out which node(s) left the cluster, check the logs (located by default in the logs folder of your Elasticsearch home directory) for a line similar to the following:

[TIMESTAMP] ... Cluster health status changed from [GREEN] to RED

Reasons for node failure can vary, ranging from hardware or hypervisor failures, to out-of-memory errors. Check any of the monitoring tools outlined here for unusual changes in performance metrics that may have occurred around the same time the node failed, such as a sudden spike in the current rate of search or indexing requests. Once you have an idea of what may have happened, if it is a temporary failure, you can try to get the disconnected node(s) to recover and rejoin the cluster. If it is a permanent failure, and you are not able to recover the node, you can add new nodes and let Elasticsearch take care of recovering from any available replica shards; replica shards can be promoted to primary shards and redistributed on the new nodes you just added.

However, if you lost both the primary and replica copy of a shard, you can try to recover as much of the missing data as possible by using Elasticsearch’s snapshot and restore module. If you’re not already familiar with this module, it can be used to store snapshots of indices over time in a remote repository for backup purposes.

Problem #2: Help! Data nodes are running out of disk space

If all of your data nodes are running low on disk space, you will need to add more data nodes to your cluster. You will also need to make sure that your indices have enough primary shards to be able to balance their data across all those nodes.

However, if only certain nodes are running out of disk space, this is usually a sign that you initialized an index with too few shards. If an index is composed of a few very large shards, it’s hard for Elasticsearch to distribute these shards across nodes in a balanced manner.

This is the most thorough look at Elasticsearch internals that I’ve seen (although admittedly that’s not something I’m usually on the lookout for).

Comments closed

Using Pester To Validate Script Installations

Rob Sewell wants to use Pester to guarantee that he has Ola’s maintenance scripts installed on a server:

First I thought about what I would look for in SSMS when I had installed the maintenance solution and made a list of the things that I would check which looked something like this. This would be the checklist you would create (or have already created) for yourself or a junior following this install. This is how easy you can turn that checklist into a Pester Test and remove the human element and open your install for automated testing
  • SQL Server Agent is running – Otherwise the jobs won’t run🙂

  • We should have 4 backup jobs with a name of

  • DatabaseBackup – SYSTEM_DATABASES – FULL

  • DatabaseBackup – USER_DATABASES – FULL

  • DatabaseBackup – USER_DATABASES – DIFF

  • DatabaseBackup – USER_DATABASES – LOG

  • We should have Integrity Check and Index Optimisation Jobs

  • We should have the clean up jobs

  • All jobs should be scheduled

  • All jobs should be enabled

  • The jobs should have succeeded

There’s a very nice script and walkthrough of the process if you click through.

Comments closed

SSISDB Management

Andy Leondard describes steps you can take to maintain the SSIS Catalog:

Back It Up

As with all SQL Server database, please back up SSISDB. What follows is a (very) basic guide describing one simple method to backup your SSISDB database. Please, please, please learn more about SQL Server backup and restore options and their implications before backing up an SSISDB database in your enterprise. Feel free to use the steps I describe on your laptop or a virtual machine. And please remember…

Backups are useless. Restores are priceless. Conduct practice Disaster Recovery exercises in which you restore databases and then test functionality. You’ll be glad you did. Here is a link containing Microsoft’s advice on restoring the SSISDB database in SQL Server 2016.

The advice is pretty similar to what you’d expect for any other database, but there are a couple twists around SSISDB functionality, so do read on.

Comments closed

Automating Deadlock Execution Plan Collection

Michael J. Swart comes up with a system to collect execution plans at the time of a deadlock and log them to a table for further research:

So How Do I Get To The Execution Plans?
So when I look at a deadlock graph, I can see there are sql_handles. Given that, I can grab the plan_handle and then the query plan from the cache, but I’m going to need to collect it automatically at the time the deadlock is generated. So I’m going to need

  • XML shredding skills

  • Ability to navigate DMVs to get at the cached query plans

  • A way to programatically respond to deadlock graph events (like an XE handler or a trigger)

If you don’t have the funding to get a third-party tool in place which collects this information, this could be a good fit.

Comments closed

TCP Chimney Offloading

Wayne Sheffield offers advice on TCP Chimney Offloading:

With all of these changes to the OS, which setting should we use for SQL Server? In general, for all of these operating systems, I recommend that TCP Chimney Offload be disabled – because you can see odd connectivity problems in any other state. Notice in the above quote that Microsoft says that this feature is best used for applications with long-lived connections that transfer large amounts of data – hopefully your OLTP database is performing lots of short-lived connections and they are not transferring large amounts of data (if they are, I can help you with that!).

Definitely worth a read.

Comments closed

Changing Identity Start Value

Kenneth Fisher has a good post on what happens when you change the seed value of an identity column:

Well Paul told me this wasn’t the case. Now when Paul tells me something I believe him, but I also like to run tests. So I decided to usesys.fn_PhysLocCracker(%%physloc%%). %%physloc%% returns a varbinary that gives you the location of the row. When passed tosys.fn_PhysLocCracker(%%physloc%%) it returns the database file, page in the file, and slot number where the row can be found. So to start with I create an identity(1,1) and I run 20 inserts, one at a time, checking row locations each time. This is to confirm I’m right about this part.

Clicking through is worth it for the hypnotizing animated GIFs.

Comments closed

Why Force Query Store Plans

Grant Fritchey explains the wherefore behind query store plan forcing:

But, what else does Force Plan do for you? What if you never experience bad parameter sniffing (you do, but I’m not going to argue the point)? Is there something else that Force Plan can do for you? Heck yes! The whole point of creating the Query Store was in order to address Plan Regression. What the heck is plan regression? When Microsoft makes any change to the Query Optimizer, and those changes come all the time, it’s possible that you might see a change in your execution plans. Most of the time, it’s going to be a positive change. That’s why they’re changing the Optimizer after all, to improve it. However, sometimes, you’re benefiting from the old behavior of the Optimizer and that new plan doesn’t work as well as the old plan. This is plan regression. When Microsoft changed the Cardinality Estimation engine in SQL Server 2014, more than a few people experienced the new estimator giving row estimates that resulted in a different execution plan that didn’t perform as well as the old plan. This is plan regression. What to do?

This is a good read.

Comments closed

Error Handling With Extended Events, Part 2

Dave Mason continues his discussion of using Extended Events to handle errors:

In the last post, we explored a couple of examples of using Extended Events to enhance T-SQL error handling. There was some potential there. But a hard-coded SPID was necessary: we couldn’t use the code examples for anything automated. It was cumbersome, too. Let’s change that, shall we?

To make the code easier to work with, I moved most of it into three stored procs: one each to create an XEvent session, get the XEvent session data, and drop the XEvent session. There’s also a table type. This will negate the need to declare a temp table over and over. The four objects can be created in any database you choose. I opted to create them in [tempdb]. The code for each is below in the four tabs.

This is a very interesting solution.

Comments closed

Automating DMV Scripts

Sander Stad has a Powershell script to automate using Glenn Berry’s excellent DMV queries:

I’ve used Glenn’s DMV scripts for years but always found them tedious to execute because there are about 70 individual scripts that either query instance or retrieve database information. Glenn did make it easier for you by creating Excel templates to save the information in.
There are separate scripts for each version of SQL Server that updated every month. Glenn only updates the versions for 2012 to 2016 with new features. The scripts are very well documented and even contain hints about how to solve some issues.

Click through for more information on how to install this Powershell module.

Comments closed

Azure SQL Database Size Quotas

Dimitri Furman discusses the MAXSIZE property on an Azure SQL Database:

Customers can use this ability to allow scaling down to a lower service objective, when otherwise scaling down wouldn’t be possible because the database is too large.

While this capability is useful for some customers, the fact that the actual size quota for the database may be different from the maximum size quota for the selected service objective can be unexpected, particularly for customers who are used to working with the traditional SQL Server, where there is no explicit size quota at the database level. Exceeding the unexpectedly low database size quota will prevent new space allocations within the database, which can be a serious problem for many types of applications.

One more thing to think about, I suppose.

Comments closed