Whilst working on an upcoming post I realised that I needed a decent data set on which to test some of the new visualisations in SSRS 2016. I remembered this post on open data sets by Kendra Little that mentioned Data.gov. There are a lot of different data sets here but I settled on a set describing social media usage by organisations in New York.
The Import-Export wizard is nice for these types of one-off data loads—it’s the gateway into SSIS.
However I wanted to add some tests. Does this really work? What if I don’t have a backslash? I thought the best way to do this was with a few tSQLt tests, which I quickly built. The entire process was 5-10 minutes, which isn’t a lot longer than if I had been running random tests myself with a variety of strings.
The advantage of tests is that if I come up with a new case, or another potential bug, I copy the test over, change the string and I have a new test, plus all the regressions. I’m not depending on my memory to run the test cases.
I first put the code in a function, which makes it easier to test.
tsqlt is a great tool for database unit testing.
I added a column: RollinAvgSteps = AVERAGEX(FILTER(fitbit_export_20160214, EARLIER(fitbit_export_20160214[Date])>=fitbit_export_20160214[Date]),fitbit_export_20160214[Steps])
…which takes the average of my steps to date. There are a bunch of ways to achieve this, but this is the way that I chose. And you can see that the average line is (happily) improving! Oh, and because I pulled down the extract on the 14th, there’s a dip at the end. My numbers were much healthier by the end of the day, and despite spending way too long NOT walking, I did about 7244 steps that day.
You can see the result at http://bit.ly/RobFitbit
I like the rolling average that Rob added in.
The first thing you’ll notice is that a single space is stored the same way in both columns. With an empty string, on the other hand, we see a difference. Char columns are fixed length. So even though we inserted an empty string into it we get back a single space.
The next major difference is that varchar columns require an extra two bytes of storage. So a varchar(1) column actually uses three bytes not just the one byte that char(1) does.
This is exactly the type of scenario row-level compression improves.
Let’s bottom line this. Data is dirty. Any ETL process is likely to expose the trashy data, so what are the real issues here? First up, Powershell is my bestest buddy. Scripting everything through Posh right out of the gate made a huge difference in my ability to constantly reset. The fact that our most common processes spit out UTF16, while not a problem, is something you need to get on top of (see Posh above). AZCopy works really well and the command line is easy to implement, but you’ll need to go through the process of setting up all the appropriate connections. Polybase is wicked cool. Yes, I had to hop through hoops to get the data clean, but, what you should note in the above descriptions, at no point did I need to go and recreate my EXTERNAL TABLE from Azure SQL Data Warehouse. I just kept updating the files through my scripts and as soon as they were loaded up to blob storage, I ran a query against them. No additional work required. That’s nice. You’re going to want to go through your data types on your data before you start the migration. You’re going to have to plan for a lot of time cleaning your data if you haven’t already.
Grant’s method is the right way, especially for early tests. In practice, for the type of data you’d put into Polybase, you might want to create the external table to allow rejecting a certain number of rows—Grant didn’t specify the REJECT_TYPE and REJECT_VALUE attributes creating his external table, so the default of 0 rows was used. In a warehouse with billions of rows, hand-fixing all of that data is a nasty proposition, and if you’re writing queries whose results likely won’t change if a few (dozen?) records get dropped, rejecting bad data might be a good way to keep some of your sanity.
At the most basic level, the precedence rule states that once there is an override row for a database, that database will never leave that level…it will never default back to the default row. So in this example, MinionDev is at the database level for its settings, so it will never go back up to the more generic MinionDefault row. Once you’re at a level, you stay at that level.
Even if you don’t use Minion Backup, this is an interesting post because you can walk through Sean’s design process and think about the approach he took to get to his final result.
You can think of row compression as working by treating certain fixed-length data types as variable-length data types. By removing certain metadata, NULL and 0 values, and the padding of fixed-length values, SQL Server can reduce the total size of a row.
The easiest way to think of it is that char(n) no longer takes n bytes for every row, but instead gets treated more like varchar(n) where the storage used varies for each value. The behavior for each data type varies, with some data types getting more or less (or no) savings compared to others.
Row-level compression is the “safer” of the two primary compression options, but I almost never use it. That might just be a function of the my particular workloads, of course.
The management and maintenance of servers quickly becomes complex without standardisation. PowerShell allows us many ways of responding to different problems and occasionally bypassing certain restrictive techniques. The danger is that we end up with a plethora of scripts in order to manage a server, all more complicated than the other, and this works. We do much better with a standard way of automating a task
DSC gives us a declarative model for system configuration management. What that really means is that we can specify how we want a workstation or server (a ‘node’) to be configured and we leave it to PowerShell and the Windows Workflow engine to make it happen on those target ‘nodes’. We don’t have to specify how we want it to happen.
DSC is great for keeping servers and server configurations in sync.
Looking at this graphic, this is no way lists all of the versions of Excel which Microsoft sells. What about Office 365 Enterprise E1? Surely you would get Power Pivot functionality with that right? No. How much more money is Power Pivot going to cost you? Well, if you have Office 365, you are paying $8 a month for the Office software, including Excel. There is no guarantee that spending more money will provide Power Pivot though. Office 365 ProPlus, which has Power Pivot, will run you $12 a month. If you have Office Small Business Premium, which runs $12.50 a month you won’t get Power Pivot. Check the version of Excel 2016 by going to File->Account then look at what is listed. If the version isn’t Office 365 Pro Plus or one of the other versions listed in the graphic, there will be no way to make Power Pivot appear.
Based on Ginger’s explanation this seems like something that will be very confusing for some Excel users.
The principle of least privilege should apply everywhere, certainly in production, but also in development. If you limit permissions in development, you might cause a few headaches, but you’ll understand the issues and solve them early on. More importantly, if you have security flaws, they aren’t in production systems where data is exposed.
SQL Server security isn’t that hard, but it can be cumbersome. Set it up properly in development, keep your scripts (even from the GUI), and then use those scripts for your production systems.
Red Gate’s usually pretty good about publishing minimum permission requirements; some vendors will simply say “you need sysadmin or db_owner.” I’m not enthralled with vendors which take the lazy way out.