One of the most common ways to get an event notification is by email. So what happens when you get 500 emails in a day and only one or two are actionable? Do you read every single email? Spending quite literally hours to find those one or two gems? Or do you just ignore the whole lot and wait for some other notification that there is a problem. Say, by a user calling you?
Next, let’s say you have a job that runs every few minutes checking if an instance is down. When that instance goes down you get an immediate email. Which is awesome! Of course then while you are trying to fix the issue you get dozens more emails about the same outage. That is at best distracting and at worst makes it take longer for you to fix the issue.
Fun story time: at one point during my work career, there was a person (not me!) who accidentally broke every single SQL Agent job on dozens of instances and nobody noticed it for hours. These weren’t production instances so it wasn’t the end of the world or anything…except that included in the broken jobs were a bunch which ran every minute. And alerted every minute. Via e-mail. The entire database team essentially lost e-mail access for 3 days as there were so many messages coming in that it overwhelmed our provider’s ability to serve messages to us. This type of mistake can happen, and if we had put into place some of the things Kenneth talks about, the consequences would have been less severe.
Not very helpful. Sure, I know the job failed, and what step it failed on, but now I have to connect to the agent and look up the history to determine if this is something I have to worry about.
It would be nice to receive the details seen in the history of the job showing up in the email alert received.
Recently I have been working with systems that had all the alert on failure configured so we knew when things failed and could jump on re-running them if needed. We even had them showing up into a data team slack channel, so we had a history as well as notification to everyone on the team at the same time. The problem is that there were not any details in the alerts we received so we had to be able to connect and figure out what to do next or hope that our paid monitoring service would act on something after reading the details of the failure.
Chris has provided a script and gives some recommendations on job configuration which might reduce the number of alerts you get.
This code was run on PowerShell version 5 and will not run on PowerShell version 3 or earlier as it uses the where method
I put all of our jobs that I required on the estate into a variable called $Jobs. (You will need to fill the $Servers variable with the names of your instances, maybe from a database or CMS or a text file
Click through for the script and line-by-line explanation.
We will use the sqlserver module, so you will need to have installed the latest version of SSMS from https://sqlps.io/dl
This code was run using PowerShell version 5 and will not work on Powershell version 3 or lower as it uses the where method.
Lets grab all of our jobs on the estate. (You will need to fill the $Servers variable with the names of your instances, maybe from a database or CMS or a text file)
One oddity with SQL Agent jobs is that you absolutely need to call the Alter() method at the end or else the changes will not actually take effect.
When you script out a SQL Agent Job you’ll notice that the job schedule will have a schedule_uid parameter (providing your job has a schedule). The gottcha lies in that schedule_uid. If you create another job schedule with the same schedule_uid, it will overwrite the schedule for any jobs that are using it. i.e. Any other jobs that are using that schedule_uid will start using the new schedule. Normally I consider UID’s as very unique and chances of a collision are low, but if you do a fair amount of copying jobs between SQL Servers there’s a good chance this will bite you eventually. That’s what happened to us (more than once).
Lets see if I can explain it better in an example.
Something to think about when scripting SQL Agent jobs.
I needed to update some of our long running job monitoring code to improve it from the version that we have right now. I like this version because it uses msdb.dbo.syssessions (https://msdn.microsoft.com/en-us/library/ms175016.aspx) to validate that a job is actually running. I also wanted to know the percent difference between the current run duration versus an average duration per job from the past 30 days. I decided to place the calculated average into a table variable and then join on it to get my results. I also used the IIF function (https://msdn.microsoft.com/en-us/library/hh213574.aspx) to help me avoid a divide by zero error that comes up when the average duration equals 0.
One thing which could cut down on false positives would be to calculate the standard deviation as well. I wouldn’t automatically assume that job executions were normally distributed, but if you look at things more than one standard deviation away from the mean, it should remove noise of jobs which are just a little over the average but not in dangerous territory.
These alerts cover a range of errors from potential IO subsystem problems to failed logins, all of which are things a DBA needs to know about, and quickly too.As well as error notifications you can set up alerts to cover performance conditions. The final statement in the script below sets up an alert that triggers when Page Life Expectancy drops below 1000. In all honesty I don’t set up these performance alerts that often but I wanted to show you the kind of thing that is possible and would be handy if you don’t have any third party monitoring.
But what do I mean by sensible? Typically I see a number of problems with alerting setups; either alerts are inadequate and don’t cover the necessary errors (or there are none at all) but I also see the notifications to alerts not being set up correctly meaning problems go backwards and forwards delaying any fixes.The other problem I see is an over provision of alerts. This usually is because one or more other monitoring systems have been deployed and error notifications have been duplicated as a result. Imagine having an operational tool like System Centre, some SQL monitoring software and native alerting all pinging the same message to the one recipient mailbox. Now on top of that let’s say the alerts have not been configured correctly so information emails are being issued every second. It’s a scary thought but it is easy to see how a critical error might be missed in this scenario.
If you don’t have automatic alerts for high-severity errors, this is an easy way of gaining insight into the problems your server is experiencing.
If all you have is a hammer, everything will eventually start looking like a nail. This is generally known as Maslow’s hammer and refers to the fact that you use the tools you know to solve any problem, regardless if that’s what the problem actually needs. With that said, I frequently need a way to visualize the load distribution of scheduled jobs over a day or week, but I could never be bothered to set up a web server, learn a procedural programming language or build custom visualizations in PowerBI.
So here’s how to do that without leaving Management Studio.
Click through for discussion and link to the code.
The first two parts of this series addressed the general approach that I use in an SSIS script task to discover and alert on missed SQL Agent jobs. With apologies for the delay in producing this final post in the series, here I bring these approaches together and present the complete package.
To create the SSIS, start with an empty SSIS package and add a data flow task. In the task, add the following transformations.
Regardless of how you do it, knowing when jobs fail is important enough to build some infrastructure around answering this question.
The first job took 15 hours 41 minutes 53 seconds, the second 1 minute 25 seconds, the third 21 seconds. This makes it quite tricky to calculate the duration in a suitable datatype. In T-SQL people use scripts like the following from MSSQLTips.com
((run_duration/10000*3600 + (run_duration/100)%100*60 + run_duration%100 + 31 ) / 60) as ‘RunDurationMinutes’
I wish that some version of SQL Server would fix this “clever” duration. We’ve had the time datatype since 2008; at least add a new column with run duration as a time value if you’re that concerned with backwards compatibility.