Finding Failed Agent Jobs

Adrian Buckman has a stored procedure which retrieves failed SQL Agent jobs over a given timeframe:

So here is what it does:

  1. Check for failed agent jobs within the dates you specify (provided the agent log has data for this period) @FromDate will default to the past 12 hours if  no value is passed in, @ToDate will default to GetDate() if no value is passed in.

  2. Check that any failed jobs that have occurred within the date range have not subsequently succeeded, if the job has since succeeded since its last failure then this job will not be included in the results.

  3. Check that the job is not currently running, if the job is still running at the time of executing the procedure then this job will be excluded from the results.

  4. If a failed agent job exists that passes through the above checks then the Last Failure message will be obtained for the job and shown in the results along with the Job name, Failed Step and Failed Datetime.

Read on for the script.

Availability Group Agent Alerts

Tracy Boggiano builds a set of SQL Server alerts related to Availability Group happenings and issues:

For Availability Groups we have a few extra error numbers we care about. Error number 1480 tells when a server changes roles, so we can know when a server flips from a secondary to a primary, or from a primary to a secondary. Error number 35264 tells when data movement has suspended on any database. This can occur for many reasons. One I have seen is when you have expanded your mount point on your primary and the data or log file runs out of space on the secondary the data or log file can not expand on the secondary because you forgot to expand the secondary. Error number 35265 tells you when the data movement has resumed on any database.  Error number 41404 let’s you know if your AG is offline which can be bad if you expected an automatic failover.  Error number 41405 let’s you know if an Availability Group can’t automatically failover for any reason.  In the later to cases you will want to look at your SQL Error logs and AlwaysOn Extended Events Health session.

Click through for the alert scripts.

Monthly Job Run Time Averages

Tywan Terrell has a script to see how his monthly SQL Agent jobs are performing in terms of average run time:

Sometime as a ETL developer or Database Administrator you will need to gain insight into SQL Agent job executions times. This  insight can be used to proactively monitor the processing times of the various jobs running within your data environment.

Information about jobs execution times is stored in the MSDB database in table sysjobhistory. This table has the start time and the run duration times which I have used to create a report that will show the average job start and end times by month for all jobs running on a instance of SQL Server.

This is a very useful start.  If I start counting on this data, I’d do two things:  first, save it somewhere else permanently (because you want to clear out SQL Agent job history occasionally so the GUI doesn’t choke when you try to view job history); and second, look more at percentiles, particularly 95th and 99th percentiles for frequently-running jobs.

SQL Agent’s 5 Second Rule

Ewald Cress uncovers a change in the way the SQL Agent scheduler works in SQL Server 2016 compared to prior versions:

Upon completion of a job, the next run time is calculated based on the last scheduled time plus the schedule interval. However, allowance is made for the edge cases where the completed invocation overruns into the next start time. In such a situation, there isn’t a “catch-up” run; instead, the schedule is advanced iteratively until it reaches a future point in time.

However, 2016 introduces a new twist. When applying the “is the proposed next schedule time after Now()?” check, it adds five seconds to Now(). In other words, the question becomes “Is the proposed next schedule time more than five seconds in the future?”

Ewald jumps into the debugger to understand this better, so click through for that.

Replacing SQL Agent In Azure

Bob Rubocki has some Q&A regarding automating in Azure the types of things you’d normally run SQL Agent jobs for:

Q: Is there any way to handle the execution of SSIS packages stored locally?

A: Azure Automation works on Azure resources.  It cannot be used for executing local SSIS packages.

In some cases, you may still need a scheduling tool (which might be a VM with SQL Agent).

Connecting To Linux SQL Agent Using Powershell

Slava Murygin shows how to connect to a SQL Agent running on Linux using the SqlServer Powershell module:

From this point we will work directly with SQL Server.
In order to establish connection you have to run following script.
The most important are 2nd and third lines:
– In second line you have to provide your SQL Server Instance address, by replacing “<your_server_instance>” by something like “” or “\MSSQLSERVER,1433”
– When second line runs it will ask you for SQL Server credentials !!! So, you have to enter SQL user name and it’s password.

Slava does note some limitations at present, but a lot of the functionality seems to be there.

PBM Schedule Failures

Dave Turpin diagnoses an issue where scheduled Policy-Based Management policy checks were failing:

While it is easy to build and test policies by executing them on demand (especially powerful when run through Central Management Server) I had some issues getting my policies to run in “on schedule” mode.

To be more specific, my policies that use the ExecuteSQL function have been an issue.  What I was finding was:

  • The policy would run fine “on demand” but…
  • When I run the policy through the PBM scheduler, the policy would fail.

Dealing with false positives is not a good start for any monitoring service, so getting to the root of the issue was critical.

Read on for the solution.

Indexing Woes

Shane O’Neill relates a tale of trying to create an index with a SQL Agent job.  Easy, right?

Now I’m angry too since I count these failures as personal and I don’t like failing, so I get cracking on the investigation.
Straight away, that error message doesn’t help my mood.
I’m not indexing a view!
I’m not including computed columns!
It’s not a filtered index!
The columns are not xml data types, or spatial operations!
And nowhere, nowhere am I using double quotes to justify needing to set QUOTED_IDENTIFIER on!


Read the whole thing.

Adding Powershell Job Steps To Existing SQL Agent Jobs

Rob Sewell uses Powershell to add a Powershell job step to a set of existing SQL Agent jobs:

I put all of our jobs that I required on the estate into a variable called $Jobs. (You will need to fill the $Servers variable with the names of your instances, maybe from a database or CMS or a text file and of course you can add more logic to filter those servers as required.

$Jobs = (Get-SQLAgentJob -ServerInstance $Servers).Where{$_.Name -like '*PartOfNameOfJob*' -and $_.IsEnabled -eq $true}

Of course to add a PowerShell Job step the target server needs to be SQL 2008 or higher. If you have an estate with older versions it is worth creating a SMO server object (you can use a snippet) and checking the version and then getting the jobs like this

Click through for the process.

Preventing Event Storms

Kenneth Fisher has some good advice when dealing with event notifications:

One of the most common ways to get an event notification is by email. So what happens when you get 500 emails in a day and only one or two are actionable? Do you read every single email? Spending quite literally hours to find those one or two gems? Or do you just ignore the whole lot and wait for some other notification that there is a problem. Say, by a user calling you?

Next, let’s say you have a job that runs every few minutes checking if an instance is down. When that instance goes down you get an immediate email. Which is awesome! Of course then while you are trying to fix the issue you get dozens more emails about the same outage. That is at best distracting and at worst makes it take longer for you to fix the issue.

Fun story time:  at one point during my work career, there was a person (not me!) who accidentally broke every single SQL Agent job on dozens of instances and nobody noticed it for hours.  These weren’t production instances so it wasn’t the end of the world or anything…except that included in the broken jobs were a bunch which ran every minute.  And alerted every minute.  Via e-mail.  The entire database team essentially lost e-mail access for 3 days as there were so many messages coming in that it overwhelmed our provider’s ability to serve messages to us.  This type of mistake can happen, and if we had put into place some of the things Kenneth talks about, the consequences would have been less severe.


June 2017
« May