# Day: March 9, 2018

So looking at the code I wrote, you may have expected`c2` to hold `NA, 3, 5, ...` where it’s taking the current value and the prior value to make a window of width 2. Another reasonable alternative is that you may have expected `c2` to hold `NA, NA, 3, ...` where it’s summing up the prior two values. But hey, it’s kinda working like `cumsum()` right so that’s ok! But wait, check out `c3`. I gave `c3` a window of width 3 and it gave me `NA, 6, 9, ...` which looks like it’s summing the prior value, the current value, and the next value. …. That’s weird right?

It turns out the default behaviour for these rolling calculations is to center align the window, which means the window sits over the current value and tries it’s best to fit over the prior and next values equally. In the case of us giving it an even number it decided to put the window over the next values more than the prior values.

Knowing your window is critical when using a window function, and knowing that some functions have different default windows than others helps you be prepared.

The starting point was a recent proposal of Wilkinson’s, his HDoutliers algorithm. The plot above shows the default O3 plot for this method applied to the stackloss dataset. (Detailed explanations of O3 plots are in the OutliersO3 vignettes.) The stackloss dataset is a small example (21 cases and 4 variables) and there is an illuminating and entertaining article (Dodge, 1996) that tells you a lot about it.

Wilkinson’s algorithm finds 6 outliers for the whole dataset (the bottom row of the plot). Overall, for various combinations of variables, 14 of the cases are found to be potential outliers (out of 21!). There are no rows for 11 of the possible 15 combinations of variables because no outliers are found with them. If using a tolerance level of 0.05 seems a little bit lax, using 0.01 finds no outliers at all for any variable combination.

I have two new posts in my launching a data science project series.  The first one covers data modeling theory:

Wait, isn’t self-supervised learning just a subset of supervised learning?  Sure, but it’s pretty useful to look at on its own.  Here, we use heuristics to guesstimate labels and train the model based on those guesstimates.  For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us.  The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet.  So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

At this point, I want to build the Keras model. I’m creating a `build_model` function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the `RMSProp` optimizer. This is a good default optimizer for neural networks, although you might try `Adagrad``Adam`, or `AdaMax` as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to \$0 this is, the better.

Click through for those two posts, including seeing how close I get to a reasonable model with my neural network.

Creating comma separated strings from a column, or delimited strings as I like to call it, is a very common problem in SQL. Beginning with SQL Server 2017 and Azure SQL Database, there is now another option to the existing set of solutions, STRING_AGG().

I would like to convince you to use STRING_AGG over the other methods. So, let us begin with the competing solutions.

I completely agree and have been switching code over to use STRING_AGG since upgrading to 2017.  The code is so much clearer as a result compared to STUFF + FOR XML PATH concatenation.

Last week I got involved with a customer issue. A refresh of the data imported to a PBIX always works in Power BI Desktop, but the refresh operation intermittently fails in the Power BI Service. Their workaround had been to refresh the PBIX in Desktop and re-upload the file to the Service. This post is about finding and fixing the root cause of the issue – this is as of March 2018, so this behavior may very well change in the future.

Turns out, the source of the problem was that the customer’s Open Orders table can contain invalid dates – not all rows, just some rows. Since Open Orders data can fluctuate, that explains why it presented as an intermittent refresh issue. Here’s a simple mockup that shows one row which contains an invalid date:

At this point, we have two open questions:
(1) What is causing the refresh error?
(2) Why is the refresh behavior different in the Service than the Desktop tool?

Read on for the explanation of the difference, as well as a fix to prevent refresh errors due to invalid dates.

The optimizer assumes that people write a semi join (indirectly e.g. using `EXISTS`) with the expectation that the row being searched for will be found. An apply semi join row goal is set by the optimizer to help find that expected matching row quickly.

For anti join (expressed e.g. using `NOT EXISTS`) the optimizer’s assumption is that a matching row will not be found. An apply anti join row goal is not set by the optimizer, because it expects to have to check all rows to confirm there is no match.

If there does turn out to be a matching row, the apply anti join might take longer to locate this row than it would if a row goal had been used. Nevertheless, the anti join will still terminate its search as soon as the (unexpected) match is encountered.

This is a shorter article but very useful in understanding row goals, along with the rest of his series.

Almost everything is made up of smaller pieces. When you are running across a problem, (well, after you check the stupid things) start breaking up what you are doing into smaller pieces. I’m going to give a simplified version of an actual example of something I did recently. No code examples (sorry), just discussion, and only most of the tests so you can get the idea.

The problem
I’m running a job that runs a bat file that uses SQLCMD to run a stored procedure that references a linked server. I’m getting a double hop error.

There’s some good advice in here.  My slight modification to this is that if you have strong priors as to the probability distribution of the cause, hit the top one or two items (in terms of likelihood of being the cause of failure) first, and if that doesn’t fix it, then go from the beginning.  This assumes a reasonable probability distribution, which typically means that you have experienced the problem (or something much like it) before.

If you happen to be managing SQL Servers with a large number of databases and availability groups, it can sometimes be difficult to keep track of which database belongs to which availability group.

sp_WhatsMyAG will tell you just that.  You can either provide it with the database name and it’ll tell you the AG that your specified database belongs to or you can leave the parameter NULL and you’ll get the AG of the database whose context you’re currently in.

Click through for the script.