Press "Enter" to skip to content

Curated SQL Posts

Tracking Database Recovery with Extended Events

Jason Brimhall takes us through the extended events which show progress on database recovery:

Recently, I wrote a rewrite of my database recovery progress report script. That script touches on both the error log and some DMVs along with some fuzzy logic to join the data sets together. That script may not be the most complex script out there, but it is more more complex than using the power of XE.

Database recovery (crash recovery) is a nerve wrenching situation under the wrong conditions. It can be as bad as a root canal and just as necessary to endure that pain at times. When the business is waiting on you waiting on the server to finish crash recovery, you feel nervous at best. If you can be of some use and provide some information back to the business, that anxiety dissipates and the business becomes more calm as well. While the previous script can help you get that information easily enough, I want to introduce the easiest method to capture that information currently available.

Click through for more information, as well as a couple of scripts.

Comments closed

Blob Storage for Database Backups

Randolph West has a couple of tools to help upload and download database backup files:

I wrote it because AzCopy was weak and inconsistent. It was fragile, needing constant attention and monitoring in case a journalling file got stuck. Also, AzCopy didn’t keep files in sync. If a file was deleted locally (as part of a cleanup to delete old backups), AzCopy was unable to delete files remotely, so it was messy to maintain files in Blob Storage containers. The uploader was written to keep files in sync, and not have to fuss with AzCopy.

The real value of this tool though, is being able to recover the latest backup files (full, differential and transaction logs where available) which are needed to recover from a catastrophic failure. Without any knowledge of the backups, just knowing the database name, it can parse the list of files in Azure, download the necessary ones to recover, and build a T-SQL script to restore them. Literally all you need to do is run the downloader, then run the restore script.

Randolph talks about how the state of AzCopy has changed and offers up some new guidance as well as tooling updates.

Comments closed

SQL on Linux with Local Repositories

Kevin Chant explains how you can use local repositories to make installing SQL Server on Linux easier when you have servers lacking Internet access:

Now, some of you more experienced Linux users probably know there is another way to do this. When I previously had to create a Hadoop environment I had to implement a local repository.

In other words, I copied the contents of the online repository locally and created my own repository on a server.

That way I could manage installing software onto multiple clients using the same secured files in a local location instead of online. Which means I could install using the same method I would have done with internet access.

Kevin provides links and some notes on the process.

Comments closed

Extended Event Duration Units of Measure

Emanuele Meazzo shows how we can find out whether that duration column is milliseconds or microseconds:

Even if I use Extended Events almost every day, I always forget the unit of measure of each duration counter, since they’re basically arbitrary; Seconds, milliseconds, microseconds? Whatever, it depends on the dev that implemented that specific counter.

That’s why I’ve added to Tsql.tech Github repository the following code that extracts the descriptions from XE DMVs in order to identify the unit of measure

Click through for the script as well as the results against Azure SQL Database.

Comments closed

Fixing Transaction Logs Having Too Many Virtual Log Files

Max Vernon has a script to reduce the number of virtual log files you have in your transaction log:

The script shrinks the existing log file down to the smallest size possible, then grows it back to the prior size using a sensible growth increment. If the database is in simple recovery model, the script runs a checkpoint command to attempt to force transaction log truncation before growing the file. If the database is in full recovery model, one or more transaction log backups are taken to the NUL: device, in addition to the checkpoint operation, to allow the necessary transaction log truncation to occur. Ensure you take and test a full backup of all databases you run this script against, prior to running the script. For databases in full recovery model, you’ll need to take a database backup after the script runs since the prior log chain will be broken by the log backups that have been take to the NUL: device.

Click through for an important warning as well as the script itself.

Comments closed

Telling a Story When Data is Always Changing

Meagan Longoria explains how you can tell a story with data when the data is not static:

If you are a BI/report developer, you know this challenge well. You may follow all the guidelines: choose a good color palette, make visuals that highlight the important data points, get rid of clutter. But what happens when your data refreshes tomorrow or next month or next year? It’s much easier to make a chart with static data. You can format it so it communicates exactly the right message. But out here in Automated Reporting Land, that is not the end of our duties. We have to make some effort to accommodate future data values.

Meagan uses the Phone-a-Friend option and gets an interesting inversion of the normal solution.

Comments closed

Benefits of Partitioning in Spark

The Hadoop in Real World team take a look at how appropriate partitioning can make your Spark jobs much faster:

Shuffle is an expensive operation whether you do it with plain old MapReduce programs or with Spark. Shuffle is he process of bringing Key Value pairs from different mappers (or tasks in Spark) by Key in to a single reducer (task in Spark). So all key value pairs of the same key will end up in one task (node). So we can loop through the key value pairs and do the needed aggregation.

Since production jobs usually involve a lot of tasks in Spark, the key value pairs movement between nodes during shuffle (from one task to another) will cause a significant bottleneck. In some cases Shuffle is not avoidable but in many instances you could avoid shuffle by structuring your data little differently. Avoiding shuffle will have an positive impact on performance.

Read the whole thing. Getting partitions right is critical to writing scalable Spark jobs.

Comments closed

Dates in Base R

Michael Toth explains some of the functionality available in base R (that is, not packages like lubridate) for working with dates:

When working with R date formats, you’re generally going to be trying to accomplish one of two different but related goals:

1. Converting a character string like “Jan 30 1989” to a Date type

2. Getting an R Date object to print in a specific format for a graph or other output

You may need to handle both of these goals in the same analysis, but it’s best to think of them as two separate exercises. Knowing which goal you are trying to accomplish is important because you will need to use different functions to accomplish each of these. Let’s tackle them one at a time.

There are some good insights in the post. H/T R-bloggers

Comments closed

Matrix Operations with JSON

Phil Factor takes a look at using JSON to perform memoization:

For the SQL Server developer, matrices are probably most valuable for solving more complex string-searching problems, using Dynamic Programming. Once you get into the mindset of this sort of technique, a number of seemingly-intractable problems become easier.  Here are fifty common data structure problems that can be solved using Dynamic programming. Until SQL Server 2017, these were hard to do in SQL because of the lack of support for this style of programming.  Memoization, one of the principles behind the technique is easy to do in SQL but it is very tricky to convert existing procedural algorithms to use table variables. It is usually easier and quicker to use strings as pseudo-variables as I did  with Edit Distance and the Levenshtein algorithmthe longest common subsequence, and  the Longest Common Substring. The problem with doing this is that the code to fetch the array values can be very difficult to decypher or debug. JSON can do it very easily with path array references.

The results aren’t fantastic but the code is easier at least.

Comments closed

Generating Sketchy Data

I have a post on building up a data set for my forensic accounting series:

This is where stuff gets crazy. First, I created a table named #ValuePerCategory, which has the mean price and the price standard deviation for each expense category. To get this information, I trawled through the catalog and picked reasonable-enough values for each of the categories. This is my level of commitment to getting things right(ish). The standard deviations, though, I just made up. I didn’t look at huge numbers of products and calculate these values myself. That’s the limit of my commitment to excellence and why I don’t have a giant banner on my stadium.

It’s also why John Madden never coached me.

Comments closed