Author: Kevin Feasel

Shuffling Data And Zipping Results In T-SQL

Published 2018-08-14 by Kevin Feasel

Phil Factor continues his series on pseudonymization:

The problems come with uncommon values. If you are pseudonymizing a medical database that is required for research purposes on people with potentially embarrassing diseases, and it appears on the dark web, anyone with a rare or unusual surname or first-name comes up on the list, so the shuffle doesn’t help the privacy of Fortescue Ceresole, or whatever his name may be.

If you are spoofing data entirely, you don’t necessarily have this problem because your constructed value will have no relationship to the original value. If it comes from a list of common names or if you randomly create a name ‘Thomas’, it will have no relationship to the original names in the database as long as you did things correctly and shuffle the list. Although a Markov string can produce an identical name that is uncommon, it can be eliminated from the list by an outer join with the original data.

After you shuffle data, you ‘zip’ it. Zipping lists is something you come across in procedural programming, and Linq has a good example. A .net array has an order, and all you are doing is to join by the order of the element in the list. If you randomize that order, you get a shuffle.

Read on for an example using the AdventureWorks Person.Person table.

Comments closed

Improving Power Query Performance By Reducing Variables

Published 2018-08-14 by Kevin Feasel

Reza Rad shows us an extreme case of variable fatigue in Power Query:

This is a sample Power Query file, which in that I do a very simple transformation. The transformation is adding one to the existing number. However, in this sample, we are doing it over thousands of steps! one step at a time, we are adding thousands to a number. The main reason to do it this way is to show you what is the performance you get when you have too many variables (or let’s say steps) in Power Query.

Reza reduces the number of steps and turns a 15-minute operation into one which finishes in under a second.

Comments closed

Tips For SQL Server On Linux + Docker

Published 2018-08-14 by Kevin Feasel

Kellyn Pot’vin-Gorman has a set of tips for working with SQL Server on Linux using Docker:

Easy Download

Docker is available for Mac and Windows with a simple installation by the defaults.

Download the correct installation for your OS type.

Run the installer and keep all the defaults, choosing Linux containers, not Windows containers

Reboot Windows workstations- Done.

Incredibly Simple MSSQL Container Install

Microsoft has done a great job of creating a very small, (maybe a bit too small, but we’ll get into that later…) image that can be used to create a running Linux container with SQL Server. This grants to student a great opportunity to simulate much of what it would be like to work on a real Linux server.

Click through to read the whole thing. And Kellyn’s not kidding about the image missing basic packages.

Comments closed

Using Uncertainty For Model Interpretation

Published 2018-08-13 by Kevin Feasel

Yoel Zeldes and Inbar Naor explain how uncertainty can help you understand your models better:

One prominent example is that of high risk applications. Let’s say you’re building a model that helps doctors decide on the preferred treatment for patients. In this case we should not only care about the accuracy of the model, but also about how certain the model is of its prediction. If the uncertainty is too high, the doctor should to take this into account.

Self-driving cars are another interesting example. When the model is uncertain if there is a pedestrian on the road we could use this information to slow the car down or trigger an alert so the driver can take charge.

Uncertainty can also help us with out of data examples. If the model wasn’t trained using examples similar to the sample at hand it might be better if it’s able to say “sorry, I don’t know”. This could have prevented the embarrassing mistake Google photos had when they misclassified African Americans as gorillas. Mistakes like that sometimes happen due to an insufficiently diverse training set.

The last usage of uncertainty, which is the purpose of this post, is as a tool for practitioners to debug their model. We’ll dive into this in a moment, but first, let’s talk about different types of uncertainty.

Interesting argument.

Comments closed

Tic-Tac-Toe In T-SQL

Published 2018-08-13 by Kevin Feasel

Riley Major continues his series on tic-tac-toe:

We could give it some smarts. For example, we know that in our game, we can only choose X or O, so we could put a data constraint on the Play column. And we know that in our game, you can’t play on the same space more than once, so we could put a unique constraint on the combination of GameID and Position. You could even be really creative and enforce our game’s alternating player moves by putting a data constraint on the Player column such that it equaled 1 when (Turn Modulo 2) equaled 1 and 2 otherwise. (Really it wouldn’t need to be a data column at that point, just a calculated column.)

But imposing those restrictions robs our data structure from its raison d’être. It’s no longer a general purpose game play storage system; it only works for our game.

With modern SQL Server, you could have this cake and eat some of it too with filtered indexes and more complex data constraints. If you denormalized a bit and copied your GameTypecolumn from your Games table, you could enforce your single-use positions by creating a unique index on GameID and Position, but when filtered by a particular GameType (type = Tic-Tac-Toe). Similarly, your data constraints on Play and Player could be tailored to each GameType. This will make your table definition complex.

Instead of saddling the data storage itself with all of those rules, we could enforce all of the game mechanics through our data interpretation and manipulation logic. When we saved a game move, we could make sure that an X or O was played and it could check to see whether the specified square was already used. When we analyzed a game to determine a win, all of the criteria could be housed in that consuming query. But this flexible design isn’t done inflicting its complexity on us.

Riley covers a number of T-SQL features in the process of this post.

Comments closed

Writing ssisUnit Tests With C#

Published 2018-08-13 by Kevin Feasel

Bartosz Ratajczyk shows us how to create ssisUnit tests in MSTest with C#:

In the post about using MSTest framework to execute ssisUnit tests, I used parts of the ssisUnit API model. If you want, you can write all your tests using this model, and this post will guide you through the first steps. I will show you how to write one of the previously prepared XML tests using C# and (again) MSTest.

Why MSTest? Because I don’t want to write some application that will contain all the tests I want to run, display if they pass or not. When I write the MSTest tests, I can run them using the Test Explorer in VS, using a command line, or in TFS.

UIs are great for learning how to do things and for one-off actions, but writing code scales much better in terms of time.

Comments closed

Last Observation Carried Forward In T-SQL

Published 2018-08-13 by Kevin Feasel

Pawan Khowal shows one example of implementing Last Observation Carried Forward in T-SQL:

A very close friend given this to me. In this puzzle you have to fill the price of SKU & Color Id for missing months. Note that SKU & Color Id should be considered as a business unit. So you have to set the previous value available to the missing month. Please check out the sample input and the expected output. In this solution I have not considered any performance considerations.

Included is one solution, though there are others.

Comments closed

Configuring tempdb

Published 2018-08-13 by Kevin Feasel

Jeff Mlakar looks at some basic guidelines for tempdb and shows how to configure this database:

The basic guidelines are:

Each tempdb data file should be the same initial size
Autogrowth to tempdb files should be an explicit value in MB instead of a percentage. Choose a reasonable value based on the workload. Ex. 64MB, 128MB, 256MB, etc.
The number of tempdb files should be 1 per logical processor core up to 8. At that point the performance should be monitored and if more tempdb files are needed they should be added in sets of 4.
Ideally the tempdb files are sized up to the max they will need and never have to autogrow.
Use trace flags 1117 and 1118 for versions of SQL Server < 2016. In SQL Server 2016 these trace flags are defaults.
- Trace flag 1117: when a file in the filegroup meets the autogrow threshold, all files in the filegroup grow together
- Trace flag 1118: Removes most single page allocations on the server, reducing contention on the SGAM page. TLDR; no more mixed extents – use the whole page.

There are some good pieces of advice here, and Jeff includes a great example of a terrible setup.

Comments closed

Making A Readable Presentation Template

Published 2018-08-13 by Kevin Feasel

Meagan Longoria has some advice for presentation templates:

The title text is 36pt Segoe UI Light, the subheading text is 24pt Segoe UI, and the speaker info text is 14 pt Segoe UI.

Those font sizes alone make it very hard to read from the back of even the smaller rooms at the conference.

In addition to being too small, the gray text for the speaker info doesn’t have enough contrast from the white background. We want to get a contrast ratio of at least 4.5:1 (but 7:1 would be better). The contrast ratio for these colors is 4.0.

While sans serif fonts are generally thought to be easier to read in presentations, it’s better to use fonts with a stroke width that is not too thin – not necessarily wider characters, but thicker lines that make up each letter. So Segoe UI Light would not be my first choice for a title font, but Segoe UI or Segoe UI Bold might be ok.

Also, the red used on the right half of the slide is VERY bright for an element that is purely decorative, to the point that it might be distracting for some people. And the reason we need to squish our title into two lines of too-small text is because that giant red shape takes up half the page. What is more important: a “pretty” red shape to make our slide look snazzy or being able to clearly read the title of the presentation?

There’s a lot along these lines, and it’s great food for thought. Meagan includes a set of recommendations at the end, so be sure to catch those.

Comments closed

Binning And Recoding In R

Published 2018-08-10 by Kevin Feasel

Sebastian Sauer shows a few methods of practical data reshaping in R:

Recoding means changing the levels of a variable, for instance changing “1” to “woman” and “2” to “man”. Binning means aggregating several variable levels to one, for instance aggregating the values From “1.00 meter” to “1.60 meter” to “small_size”.

Both operations are frequently necessary in practical data analysis. In this post, we review some methods to accomplish these two tasks.

Click through for examples of techniques you can use.

Comments closed