PBM Schedule Failures

Dave Turpin diagnoses an issue where scheduled Policy-Based Management policy checks were failing:

While it is easy to build and test policies by executing them on demand (especially powerful when run through Central Management Server) I had some issues getting my policies to run in “on schedule” mode.

To be more specific, my policies that use the ExecuteSQL function have been an issue.  What I was finding was:

  • The policy would run fine “on demand” but…
  • When I run the policy through the PBM scheduler, the policy would fail.

Dealing with false positives is not a good start for any monitoring service, so getting to the root of the issue was critical.

Read on for the solution.

Indexing Woes

Shane O’Neill relates a tale of trying to create an index with a SQL Agent job.  Easy, right?

Now I’m angry too since I count these failures as personal and I don’t like failing, so I get cracking on the investigation.
Straight away, that error message doesn’t help my mood.
I’m not indexing a view!
I’m not including computed columns!
It’s not a filtered index!
The columns are not xml data types, or spatial operations!
And nowhere, nowhere am I using double quotes to justify needing to set QUOTED_IDENTIFIER on!

SO WTF SQL SERVER, WHY ARE YOU GIVING ME THESE ERRORS???

Read the whole thing.

Adding Powershell Job Steps To Existing SQL Agent Jobs

Rob Sewell uses Powershell to add a Powershell job step to a set of existing SQL Agent jobs:

I put all of our jobs that I required on the estate into a variable called $Jobs. (You will need to fill the $Servers variable with the names of your instances, maybe from a database or CMS or a text file and of course you can add more logic to filter those servers as required.

$Jobs = (Get-SQLAgentJob -ServerInstance $Servers).Where{$_.Name -like '*PartOfNameOfJob*' -and $_.IsEnabled -eq $true}

Of course to add a PowerShell Job step the target server needs to be SQL 2008 or higher. If you have an estate with older versions it is worth creating a SMO server object (you can use a snippet) and checking the version and then getting the jobs like this

Click through for the process.

Preventing Event Storms

Kenneth Fisher has some good advice when dealing with event notifications:

One of the most common ways to get an event notification is by email. So what happens when you get 500 emails in a day and only one or two are actionable? Do you read every single email? Spending quite literally hours to find those one or two gems? Or do you just ignore the whole lot and wait for some other notification that there is a problem. Say, by a user calling you?

Next, let’s say you have a job that runs every few minutes checking if an instance is down. When that instance goes down you get an immediate email. Which is awesome! Of course then while you are trying to fix the issue you get dozens more emails about the same outage. That is at best distracting and at worst makes it take longer for you to fix the issue.

Fun story time:  at one point during my work career, there was a person (not me!) who accidentally broke every single SQL Agent job on dozens of instances and nobody noticed it for hours.  These weren’t production instances so it wasn’t the end of the world or anything…except that included in the broken jobs were a bunch which ran every minute.  And alerted every minute.  Via e-mail.  The entire database team essentially lost e-mail access for 3 days as there were so many messages coming in that it overwhelmed our provider’s ability to serve messages to us.  This type of mistake can happen, and if we had put into place some of the things Kenneth talks about, the consequences would have been less severe.

Improving Agent Alerts

Chris Bell has a way of making SQL Agent error messages a lot better:

Not very helpful. Sure, I know the job failed, and what step it failed on, but now I have to connect to the agent and look up the history to determine if this is something I have to worry about.

It would be nice to receive the details seen in the history of the job showing up in the email alert received.

Recently I have been working with systems that had all the alert on failure configured so we knew when things failed and could jump on re-running them if needed. We even had them showing up into a data team slack channel, so we had a history as well as notification to everyone on the team at the same time. The problem is that there were not any details in the alerts we received so we had to be able to connect and figure out what to do next or hope that our paid monitoring service would act on something after reading the details of the failure.

Chris has provided a script and gives some recommendations on job configuration which might reduce the number of alerts you get.

Using Powershell To Add SQL Agent Job Steps

Rob Sewell gives a detailed walkthrough of a small Powershell script which adds a job step to every SQL Server Agent job:

This code was run on PowerShell version 5 and will not run on PowerShell version 3 or earlier as it uses the where method
I put all of our jobs that I required on the estate into a variable called $Jobs. (You will need to fill the $Servers variable with the names of your instances, maybe from a database or CMS or a text file

Click through for the script and line-by-line explanation.

Altering Job Steps With Powershell

Rob Sewell uses Powershell to modify hundreds SQL Agent job steps on hundreds of SQL Server instances:

We will use the sqlserver module, so you will need to have installed the latest version of SSMS from https://sqlps.io/dl

This code was run using PowerShell version 5 and will not work on Powershell version 3 or lower as it uses the where method.

Lets grab all of our jobs on the estate. (You will need to fill the $Servers variable with the names of your instances, maybe from a database or CMS or a text file)

One oddity with SQL Agent jobs is that you absolutely need to call the Alter() method at the end or else the changes will not actually take effect.

Agent Schedule Unique IDs

Chris Sommer has a gripe about the schedule_uid column in SQL Agent jobs:

When you script out a SQL Agent Job you’ll notice that the job schedule will have a schedule_uid parameter (providing your job has a schedule). The gottcha lies in that schedule_uid. If you create another job schedule with the same schedule_uid, it will overwrite the schedule for any jobs that are using it. i.e. Any other jobs that are using that schedule_uid will start using the new schedule. Normally I consider UID’s as very unique and chances of a collision are low, but if you do a fair amount of copying jobs between SQL Servers there’s a good chance this will bite you eventually. That’s what happened to us (more than once).

Lets see if I can explain it better in an example.

Something to think about when scripting SQL Agent jobs.

Finding Abnormally Long-Running Jobs

Lori Brown has a script to find jobs running longer than their 30-day average:

I needed to update some of our long running job monitoring code to improve it from the version that we have right now. I like this version because it uses msdb.dbo.syssessions (https://msdn.microsoft.com/en-us/library/ms175016.aspx) to validate that a job is actually running. I also wanted to know the percent difference between the current run duration versus an average duration per job from the past 30 days. I decided to place the calculated average into a table variable and then join on it to get my results. I also used the IIF function (https://msdn.microsoft.com/en-us/library/hh213574.aspx) to help me avoid a divide by zero error that comes up when the average duration equals 0.

One thing which could cut down on false positives would be to calculate the standard deviation as well.  I wouldn’t automatically assume that job executions were normally distributed, but if you look at things more than one standard deviation away from the mean, it should remove noise of jobs which are just a little over the average but not in dangerous territory.

SQL Agent Alerts

David Alcock has a script to create SQL Agent alerts for common errors:

These alerts cover a range of errors from potential IO subsystem problems to failed logins, all of which are things a DBA needs to know about, and quickly too.
As well as error notifications you can set up alerts to cover performance conditions. The final statement in the script below sets up an alert that triggers when Page Life Expectancy drops below 1000. In all honesty I don’t set up these performance alerts that often but I wanted to show you the kind of thing that is possible and would be handy if you don’t have any third party monitoring.

He follows this up with a post on appropriate response:

But what do I mean by sensible? Typically I see a number of problems with alerting setups; either alerts are inadequate and don’t cover the necessary errors (or there are none at all) but I also see the notifications to alerts not being set up correctly meaning problems go backwards and forwards delaying any fixes.
The other problem I see is an over provision of alerts. This usually is because one or more other monitoring systems have been deployed and error notifications have been duplicated as a result. Imagine having an operational tool like System Centre, some SQL monitoring software and native alerting all pinging the same message to the one recipient mailbox. Now on top of that let’s say the alerts have not been configured correctly so information emails are being issued every second. It’s a scary thought but it is easy to see how a critical error might be missed in this scenario.

If you don’t have automatic alerts for high-severity errors, this is an easy way of gaining insight into the problems your server is experiencing.

Categories

March 2017
MTWTFSS
« Feb  
 12345
6789101112
13141516171819
20212223242526
2728293031