But what if you can’t fix the source data? I was asked this question the other week, and since I had been asked about it before and not come up with a good answer, I decided to spend some time researching the problem.
What I found was that it was relatively easy to write some M code that gave me the correct results, but very hard to write code that performed acceptably well on a large data set (I was testing on a CSV file containing almost half a million rows). Here’s the code for the function I ended up with:
It’s nice to see that Power Query & Power BI have methods to get around this sort of issue, but it sounds like even those methods are limited in value.
The most critical thing as a SQL Server DBA is to ensure that your databases can be restored in the event of the loss of a server for whatever reason: disk crash, fire in the server room, tribble invasion, whatever. To do this, not only do you have to back up your databases, you also have to test restores! Create a database and restore the backups of your production DB to them. It’s the safest way to make sure that everything works. This test restore can be automated to run every night, but that’s outside the scope of what I want to talk about right now.
There are lots of places that problems can creep in, this is just one part of how you’ll need to monitor systems. This is how I’ve done things for a number of years, and thus far it has served me well.
Depending upon your instance count, average database size, maintenance windows, etc. etc. etc., some of these things may or may not work, but the principle is the same: protect the data, and automate your processes to protect that data. This is a good article to read for ideas, and then from there dig into other administrative blog posts, videos, and books to become better versed in the tools and techniques available to protect your data.
Fortunately we can find queries with high CPU time using sys.dm_exec_query_stats DMV. This DMV, created in SQL Server 2008, keeps performance statistics for cached query plans, allowing us to find the queries and query plans that are most harming our system.
Glenn Berry’s fantastic set of diagnostic queries also includes a couple for finding CPU consumers.
There are many ways of doing it in SQL. (and Rosetta Code is a good place to view solutions in various other languages). I believe that Peter Larsson holds the record for the fastest calculation of the Luhn test for a sixteen-digit credit card, with this code. As it stands, it isn’t a general solution, but it can be modified for different lengths of bank card.
Phil has two interesting T-SQL functions in the code and wants to find more.
Whilst working on an upcoming post I realised that I needed a decent data set on which to test some of the new visualisations in SSRS 2016. I remembered this post on open data sets by Kendra Little that mentioned Data.gov. There are a lot of different data sets here but I settled on a set describing social media usage by organisations in New York.
The Import-Export wizard is nice for these types of one-off data loads—it’s the gateway into SSIS.
However I wanted to add some tests. Does this really work? What if I don’t have a backslash? I thought the best way to do this was with a few tSQLt tests, which I quickly built. The entire process was 5-10 minutes, which isn’t a lot longer than if I had been running random tests myself with a variety of strings.
The advantage of tests is that if I come up with a new case, or another potential bug, I copy the test over, change the string and I have a new test, plus all the regressions. I’m not depending on my memory to run the test cases.
I first put the code in a function, which makes it easier to test.
tsqlt is a great tool for database unit testing.
I added a column: RollinAvgSteps = AVERAGEX(FILTER(fitbit_export_20160214, EARLIER(fitbit_export_20160214[Date])>=fitbit_export_20160214[Date]),fitbit_export_20160214[Steps])
…which takes the average of my steps to date. There are a bunch of ways to achieve this, but this is the way that I chose. And you can see that the average line is (happily) improving! Oh, and because I pulled down the extract on the 14th, there’s a dip at the end. My numbers were much healthier by the end of the day, and despite spending way too long NOT walking, I did about 7244 steps that day.
You can see the result at http://bit.ly/RobFitbit
I like the rolling average that Rob added in.
The first thing you’ll notice is that a single space is stored the same way in both columns. With an empty string, on the other hand, we see a difference. Char columns are fixed length. So even though we inserted an empty string into it we get back a single space.
The next major difference is that varchar columns require an extra two bytes of storage. So a varchar(1) column actually uses three bytes not just the one byte that char(1) does.
This is exactly the type of scenario row-level compression improves.
Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my bestest buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.
Grant’s method is the right way, especially for early tests. In practice, for the type of data you’d put into Polybase, you might want to create the external table to allow rejecting a certain number of rows—Grant didn’t specify the REJECT_TYPE and REJECT_VALUE attributes creating his external table, so the default of 0 rows was used. In a warehouse with billions of rows, hand-fixing all of that data is a nasty proposition, and if you’re writing queries whose results likely won’t change if a few (dozen?) records get dropped, rejecting bad data might be a good way to keep some of your sanity.
At the most basic level, the precedence rule states that once there is an override row for a database, that database will never leave that level…it will never default back to the default row. So in this example, MinionDev is at the database level for its settings, so it will never go back up to the more generic MinionDefault row. Once you’re at a level, you stay at that level.
Even if you don’t use Minion Backup, this is an interesting post because you can walk through Sean’s design process and think about the approach he took to get to his final result.