I added a column: RollinAvgSteps = AVERAGEX(FILTER(fitbit_export_20160214, EARLIER(fitbit_export_20160214[Date])>=fitbit_export_20160214[Date]),fitbit_export_20160214[Steps])
…which takes the average of my steps to date. There are a bunch of ways to achieve this, but this is the way that I chose. And you can see that the average line is (happily) improving! Oh, and because I pulled down the extract on the 14th, there’s a dip at the end. My numbers were much healthier by the end of the day, and despite spending way too long NOT walking, I did about 7244 steps that day.
You can see the result at http://bit.ly/RobFitbit
I like the rolling average that Rob added in.
The first thing you’ll notice is that a single space is stored the same way in both columns. With an empty string, on the other hand, we see a difference. Char columns are fixed length. So even though we inserted an empty string into it we get back a single space.
The next major difference is that varchar columns require an extra two bytes of storage. So a varchar(1) column actually uses three bytes not just the one byte that char(1) does.
This is exactly the type of scenario row-level compression improves.
Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my bestest buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.
Grant’s method is the right way, especially for early tests. In practice, for the type of data you’d put into Polybase, you might want to create the external table to allow rejecting a certain number of rows—Grant didn’t specify the REJECT_TYPE and REJECT_VALUE attributes creating his external table, so the default of 0 rows was used. In a warehouse with billions of rows, hand-fixing all of that data is a nasty proposition, and if you’re writing queries whose results likely won’t change if a few (dozen?) records get dropped, rejecting bad data might be a good way to keep some of your sanity.
At the most basic level, the precedence rule states that once there is an override row for a database, that database will never leave that level…it will never default back to the default row. So in this example, MinionDev is at the database level for its settings, so it will never go back up to the more generic MinionDefault row. Once you’re at a level, you stay at that level.
Even if you don’t use Minion Backup, this is an interesting post because you can walk through Sean’s design process and think about the approach he took to get to his final result.
You can think of row compression as working by treating certain fixed-length data types as variable-length data types. By removing certain metadata, NULL and 0 values, and the padding of fixed-length values, SQL Server can reduce the total size of a row.
The easiest way to think of it is that char(n) no longer takes n bytes for every row, but instead gets treated more like varchar(n) where the storage used varies for each value. The behavior for each data type varies, with some data types getting more or less (or no) savings compared to others.
Row-level compression is the “safer” of the two primary compression options, but I almost never use it. That might just be a function of the my particular workloads, of course.