Besides alerting for the hardware health, monitoring answers questions about the health of the overall distributed data pipeline. The Site Reliability Engineering book identifies “The Four Golden Signals” as the minimum of what you need to be able to determine: latency, traffic, errors, and saturation.
Latency is the time it takes for work to happen. In the case of data pipelines, that work is a message that has gone through many systems. To time it, you need to have some kind of work unit identifier that is reflected in the metrics that happen on the many segments of the workflow. One way to do this is to have an ID on the message, and have components place that ID in their logs. Alternatively, the messaging system itself could manage that in metadata attached to the messages.
Traffic is the demand from external sources, or the size of what is available to be consumed. Measuring traffic requires metrics that either specifically mean a new arrival or a new volume of data to be processed, or rules about metrics that allow you to proxy the measure of traffic.
Errors are particularly tricky to monitor in data pipelines because these systems don’t typically error out on the first sign of trouble. Some errors in data are to be expected and are captured and corrected. However, there are other errors that may be tolerated by the pipeline, but need to be feed into the monitoring system as error events. This requires specific logic in an application’s error capture code to emit this information in a way that will be captured by the monitoring system.
Saturation is the workload consuming all the resources available for doing work. Saturation can be the memory, network, compute, or disk of any system in the data pipeline. The kinds of indicators that we discussed in the previous post on tuning are all about avoiding saturation.
Larry then applies these concepts and gives links to some useful tools.
I did a dangerous thing, and I want to make sure that YOU DO NOT do the same.
I was creating a couple of extended events sessions and was playing around with some actions. I ended up with the following code where I was after a guy called Shane:
The probability that you intend to set a breakpoint in SQL Server via Extended Event is quite low (low enough that if you’re doing it, you should already know what you’re doing), but click through to see exactly what damage you can do.
Now we are ready to attach the database using the TSQL below. For this demo, I used Management Studio from my Laptop, to connect to SQL Server.
In the TSQL we need to use the FOR ATTACH_REBUILD_LOG argument as we have no log file to attach. It will create a 1MB log file in the default log file directory.
It’s better to restore a full backup, but there’s more than one way to connect a database.
We have a client who has no idea how or when Common Criteria was enabled on their production system. All they know is that performance has been slowly degrading. After collecting performance data, we found that there were high LCK_M_SCH_M waits which is a schema modification lock that prevents access to a table while a DDL operation occurs. We also found blocked process records where a LOGIN_STATS table in the master database was waiting a lot. This table is used to hold login statistics. When there are a lot of logins and outs there can be contention in this table.
When you enable Common Criteria compliance, something called Residual Information Protection (RIP) is enabled. RIP is an additional security measure for memory and it makes it so that in memory a specific bit pattern must be present before memory can be reallocated(overwritten) to a new resource or login. So with lots of logins and outs, there is a performance hit in memory because overwriting the memory allocation has to be done.
It’s worth reading the whole thing.
The first error message holds the key. It’s not enough to just enable the service, you need to run that command also (which will also start the service for you as well as doing something internally) or restart SQL Server after the service is enabled.
Here’s where we go to the next level. You now know that full-text indexing can have a problem on startup that isn’t visible externally but surely this could never happen to you, right? But I guarantee you that if you have 500 servers many of them currently have this issue. When a developer or application support person encounters it that’s going to take half an hour or an hour of investigation, then logging incidents, then passing it on to you, for another half hour or hour of investigation, and then more time, and more time, and then doing a root cause analysis, and then sharing it with your team, and then…
Wouldn’t it be so much easier to just write a validation test for this now so that if this ever occurs you can just proactively go fix it? Well you can!
It’s a good read.
Monitoring Availability Groups can be tricky. The DMVs don’t update the log_send_rate and redo_rate all the time especially if nothing is happening so if you try to use those for calculations when monitoring you could false results worse yet pages in the middle of the night. In order to calculate the log_send_rate and redo_rate you need to capture the perfmon counters ‘Log Bytes Flushed/sec’, ‘Redone Bytes/sec’, and ‘Log Bytes Received/sec’ into temporary tables WAITFOR 1 second then capture them again. Below is query that captures this along side what you see in the DMV for when Microsoft might fix the issue.
Click through for a script as well as a Powershell cmdlet wrapper for running against a set of hosts in your Central Management Server.
The previous query would cause problems on many different systems, regardless of whether you’re using Databricks or another data warehousing tool. Luckily, as an user of Databricks, this customer has a feature available that can help solve this problem called the Query Watchdog.
Note: Query Watchdog is available on clusters created with version 2.1-db3 and greater.
A Query Watchdog is a simple process that checks whether or not a given query is creating too many output rows for the number of input rows at a task level. We can set a property to control this and in this example we will use a ratio of 1000 (which is the default).
It looks like this is an all-or-nothing process, but a very interesting start.
In this post, I want to publish a few functions that I created around SSRS. They are related to and depend on each other.
Get-SSRS – Given the SSRS URI returns the WSDL endpoint
Get-SSRSReport – Returns one or more reports based on inputs
Get-SSRSSharedDataSource – Returns one or more shared data sources based on inputs
Get-SSRSReportDataSource – Returns the data source information on a report by report basis based on inputs
Set-SSRSReportDataSource – Sets the data source of a report to the given data source.
Install-SSRS – Deploys an SSRS report to a specific folder and also optionally sets the datasource for the deployed report
If we rerun the query and then take a look at the first operator in the execution plan, we can see that the Plan Guide is in use… and that the query hash has changed. It no longer matches the original query. Now it matches the query that included the query hint. This actually makes perfect sense. The Plan Guide is basically changing the query from the first example above, into the second.
Now, what happens when we toss in the Query Store
The query behavior is exactly what you want, but some of the metadata is no longer correct.
Recently came across a situation where reporting logins were interfering with nightly jobs due to blocking. After a number of attempts of trying to resolve the blocking, it was decided that a stored procedure that disabled the login and killed the user sessions was the most pragmatic solution. This is the code I came up with to resolve the issue.
Click through for the script. This is definitely a last-ditch option, but it’s good to have in your bag of tricks.