Press "Enter" to skip to content

Month: March 2018

Data Modeling And Neural Networks

I have two new posts in my launching a data science project series.  The first one covers data modeling theory:

Wait, isn’t self-supervised learning just a subset of supervised learning?  Sure, but it’s pretty useful to look at on its own.  Here, we use heuristics to guesstimate labels and train the model based on those guesstimates.  For example, let’s say that we want to train a neural network or Markov chain generator to read the works of Shakespeare and generate beautiful prose for us.  The way the recursive model would work is to take what words have already been written and then predict the most likely next word or punctuation character.

We don’t have “labeled” data within the works of Shakespeare, though; instead, our training data’s “label” is the next word in the play or sonnet.  So we train our model based on the chains of words, treating the problem as interdependent rather than a bunch of independent words just hanging around.

Then, we implement a data model using a neural network:

At this point, I want to build the Keras model. I’m creating a build_model function in case I want to run this over and over. In a real-life scenario, I would perform various optimizations, do cross-validation, etc. In this scenario, however, I am just going to run one time against the full training data set, and then evaluate it against the test data set.

Inside the function, we start by declaring a Keras model. Then, I add three layers to the model. The first layer is a dense (fully-connected) layer which accepts the training data as inputs and uses the Rectified Linear Unit (ReLU) activation mechanism. This is a decent first guess for activation mechanisms. We then have a dropout layer, which reduces the risk of overfitting on the training data. Finally, I have a dense layer for my output, which will give me the salary.

I compile the model using the RMSProp optimizer. This is a good default optimizer for neural networks, although you might try AdagradAdam, or AdaMax as well. Our loss function is Mean Squared Error, which is good for dealing with finding the error in a regression. Finally, I’m interested in the Mean Absolute Error–that is, the dollar amount difference between our function’s prediction and the actual salary. The closer to $0 this is, the better.

Click through for those two posts, including seeing how close I get to a reasonable model with my neural network.

Comments closed

Using STRING_AGG In SQL Server 2017

Derik Hammer talks about one of the nicer T-SQL additions in SQL Server 2017:

Creating comma separated strings from a column, or delimited strings as I like to call it, is a very common problem in SQL. Beginning with SQL Server 2017 and Azure SQL Database, there is now another option to the existing set of solutions, STRING_AGG().

I would like to convince you to use STRING_AGG over the other methods. So, let us begin with the competing solutions.

I completely agree and have been switching code over to use STRING_AGG since upgrading to 2017.  The code is so much clearer as a result compared to STUFF + FOR XML PATH concatenation.

Comments closed

Invalid Dates And Power BI

Melissa Coates notes a discrepancy between the Desktop and Service versions of Power BI:

Last week I got involved with a customer issue. A refresh of the data imported to a PBIX always works in Power BI Desktop, but the refresh operation intermittently fails in the Power BI Service. Their workaround had been to refresh the PBIX in Desktop and re-upload the file to the Service. This post is about finding and fixing the root cause of the issue – this is as of March 2018, so this behavior may very well change in the future.

Turns out, the source of the problem was that the customer’s Open Orders table can contain invalid dates – not all rows, just some rows. Since Open Orders data can fluctuate, that explains why it presented as an intermittent refresh issue. Here’s a simple mockup that shows one row which contains an invalid date:

At this point, we have two open questions:
(1) What is causing the refresh error?
(2) Why is the refresh behavior different in the Service than the Desktop tool?

Read on for the explanation of the difference, as well as a fix to prevent refresh errors due to invalid dates.

Comments closed

Row Goals On Anti-Joins

Paul White continues his row goals series:

The optimizer assumes that people write a semi join (indirectly e.g. using EXISTS) with the expectation that the row being searched for will be found. An apply semi join row goal is set by the optimizer to help find that expected matching row quickly.

For anti join (expressed e.g. using NOT EXISTS) the optimizer’s assumption is that a matching row will not be found. An apply anti join row goal is not set by the optimizer, because it expects to have to check all rows to confirm there is no match.

If there does turn out to be a matching row, the apply anti join might take longer to locate this row than it would if a row goal had been used. Nevertheless, the anti join will still terminate its search as soon as the (unexpected) match is encountered.

This is a shorter article but very useful in understanding row goals, along with the rest of his series.

Comments closed

Think Logically When Debugging

Kenneth Fisher explains his debugging technique:

Almost everything is made up of smaller pieces. When you are running across a problem, (well, after you check the stupid things) start breaking up what you are doing into smaller pieces. I’m going to give a simplified version of an actual example of something I did recently. No code examples (sorry), just discussion, and only most of the tests so you can get the idea.

The problem
I’m running a job that runs a bat file that uses SQLCMD to run a stored procedure that references a linked server. I’m getting a double hop error.

There’s some good advice in here.  My slight modification to this is that if you have strong priors as to the probability distribution of the cause, hit the top one or two items (in terms of likelihood of being the cause of failure) first, and if that doesn’t fix it, then go from the beginning.  This assumes a reasonable probability distribution, which typically means that you have experienced the problem (or something much like it) before.

Comments closed

Finding A Database’s Availability Group

David Fowler has a quick stored procedure to figure out to which availability group a particular database belongs:

If you happen to be managing SQL Servers with a large number of databases and availability groups, it can sometimes be difficult to keep track of which database belongs to which availability group.

sp_WhatsMyAG will tell you just that.  You can either provide it with the database name and it’ll tell you the AG that your specified database belongs to or you can leave the parameter NULL and you’ll get the AG of the database whose context you’re currently in.

Click through for the script.

Comments closed

Building A SQL Operations Studio Extension

Drew Furgiuele is building a bridge to nowhere in SQL Operations Studio:

So when I sat down to write noWHERE, the concept seemed simple enough: I’ll code some text parsing into extension.js to look at the editor window’s text and look for UPDATE and DELETE statements and then try to see if there’s a missing WHERE clause. Seems reasonable enough, right? Except writing a SQL parser isn’t really something I want to do (or would be ABLE to do, frankly).

Since the module is written in JavaScript, I can leverage node.js modules to do most of the work for me. In this case, I downloaded a module called sqlite-parser. It’s a barebones, no-frills SQL syntax parser. It doesn’t support any dialect-specific stuff for T-SQL, but for this initial project, I only really care about ANSI UPDATE and DELETE statements.

There’s also a T-SQL parser available via Powershell, though I suppose that one isn’t directly available within SqlOps.  One thing that would make this really good would be to intercept the execute operation and pop up a warning dialog if there’s no WHERE clause.  There are some third-party tools which do this for Management Studio and a gate like that really saves you in the event of an emergency.

Comments closed

Be Wary Of Colliders When Analyzing Data

Keith Goldfeld has an interesting demonstration of a collider variable and how it can lead us to incorrect conclusions during analysis:

In this (admittedly thoroughly made-up though not entirely implausible) network diagram, the test score outcome is a collider, influenced by a test preparation class and socio-economic status (SES). In particular, both the test prep course and high SES are related to the probability of having a high test score. One might expect an arrow of some sort to connect SES and the test prep class; in this case, participation in test prep is randomized so there is no causal link (and I am assuming that everyone randomized to the class actually takes it, a compliance issue I addressed in a series of posts starting with this one.)

The researcher who carried out the randomization had a hypothesis that test prep actually is detrimental to college success down the road, because it de-emphasizes deep thinking in favor of wrote memorization. In reality, it turns out that the course and subsequent college success are not related, indicated by an absence of a connection between the course and the long term outcome.

Read the whole thing.  H/T R-Bloggers

Comments closed

XGBoost In R

Fisseha Berhane explains how to implement Extreme Gradient Boosting in R:

What makes it so popular are its speed and performance. It gives among the best performances in many machine learning applications. It is optimized gradient-boosting machine learning library. The core algorithm is parallelizable and hence it can use all the processing power of your machine and the machines in your cluster. In R, according to the package documentation, since the package can automatically do parallel computation on a single machine, it could be more than 10 times faster than existing gradient boosting packages.

xgboost shines when we have lots of training data where the features are numeric or a mixture of numeric and categorical fields. It is also important to note that xgboost is not the best algorithm out there when all the features are categorical or when the number of rows is less than the number of fields (columns).

xgboost is a nice complement to neural networks, as they tend to be great at different things.

Comments closed

Data Cleansing With R

I continue my series on launching a data science project:

Now that we’ve performed some basic analysis, we will clean up the data set. I’m doing most of the cleanup in a single operation, but I do have some comment notes here, particularly around the oddities with SalaryUSD. The SalaryUSD column has a few problems:

  • Some people put in pennies, which aren’t really that important at the level we’re discussing. I want to strip them out.
  • Some people put in delimiters like commas or decimal points (which act as commas in countries like Germany). I want to strip them out, particularly because the decimal point might interfere with my analysis, turning 100.000 to $100 instead of $100K.
  • Some people included the dollar sign, so remove that, as well as any spaces.

It’s not a perfect regex, but it did seem to fix the problems in this data set at least.

Something I’ve liked about the data professionals survey is that there are a few places with room for data cleansing, but not everything is awful.  It’s neither artificially clean nor beyond repair, so it’s good for use as an example.

Comments closed