Press "Enter" to skip to content

Curated SQL Posts

Invalid Dates And Power BI

Melissa Coates notes a discrepancy between the Desktop and Service versions of Power BI:

Last week I got involved with a customer issue. A refresh of the data imported to a PBIX always works in Power BI Desktop, but the refresh operation intermittently fails in the Power BI Service. Their workaround had been to refresh the PBIX in Desktop and re-upload the file to the Service. This post is about finding and fixing the root cause of the issue – this is as of March 2018, so this behavior may very well change in the future.

Turns out, the source of the problem was that the customer’s Open Orders table can contain invalid dates – not all rows, just some rows. Since Open Orders data can fluctuate, that explains why it presented as an intermittent refresh issue. Here’s a simple mockup that shows one row which contains an invalid date:

At this point, we have two open questions:
(1) What is causing the refresh error?
(2) Why is the refresh behavior different in the Service than the Desktop tool?

Read on for the explanation of the difference, as well as a fix to prevent refresh errors due to invalid dates.

Comments closed

Row Goals On Anti-Joins

Paul White continues his row goals series:

The optimizer assumes that people write a semi join (indirectly e.g. using EXISTS) with the expectation that the row being searched for will be found. An apply semi join row goal is set by the optimizer to help find that expected matching row quickly.

For anti join (expressed e.g. using NOT EXISTS) the optimizer’s assumption is that a matching row will not be found. An apply anti join row goal is not set by the optimizer, because it expects to have to check all rows to confirm there is no match.

If there does turn out to be a matching row, the apply anti join might take longer to locate this row than it would if a row goal had been used. Nevertheless, the anti join will still terminate its search as soon as the (unexpected) match is encountered.

This is a shorter article but very useful in understanding row goals, along with the rest of his series.

Comments closed

Think Logically When Debugging

Kenneth Fisher explains his debugging technique:

Almost everything is made up of smaller pieces. When you are running across a problem, (well, after you check the stupid things) start breaking up what you are doing into smaller pieces. I’m going to give a simplified version of an actual example of something I did recently. No code examples (sorry), just discussion, and only most of the tests so you can get the idea.

The problem
I’m running a job that runs a bat file that uses SQLCMD to run a stored procedure that references a linked server. I’m getting a double hop error.

There’s some good advice in here.  My slight modification to this is that if you have strong priors as to the probability distribution of the cause, hit the top one or two items (in terms of likelihood of being the cause of failure) first, and if that doesn’t fix it, then go from the beginning.  This assumes a reasonable probability distribution, which typically means that you have experienced the problem (or something much like it) before.

Comments closed

Finding A Database’s Availability Group

David Fowler has a quick stored procedure to figure out to which availability group a particular database belongs:

If you happen to be managing SQL Servers with a large number of databases and availability groups, it can sometimes be difficult to keep track of which database belongs to which availability group.

sp_WhatsMyAG will tell you just that.  You can either provide it with the database name and it’ll tell you the AG that your specified database belongs to or you can leave the parameter NULL and you’ll get the AG of the database whose context you’re currently in.

Click through for the script.

Comments closed

Building A SQL Operations Studio Extension

Drew Furgiuele is building a bridge to nowhere in SQL Operations Studio:

So when I sat down to write noWHERE, the concept seemed simple enough: I’ll code some text parsing into extension.js to look at the editor window’s text and look for UPDATE and DELETE statements and then try to see if there’s a missing WHERE clause. Seems reasonable enough, right? Except writing a SQL parser isn’t really something I want to do (or would be ABLE to do, frankly).

Since the module is written in JavaScript, I can leverage node.js modules to do most of the work for me. In this case, I downloaded a module called sqlite-parser. It’s a barebones, no-frills SQL syntax parser. It doesn’t support any dialect-specific stuff for T-SQL, but for this initial project, I only really care about ANSI UPDATE and DELETE statements.

There’s also a T-SQL parser available via Powershell, though I suppose that one isn’t directly available within SqlOps.  One thing that would make this really good would be to intercept the execute operation and pop up a warning dialog if there’s no WHERE clause.  There are some third-party tools which do this for Management Studio and a gate like that really saves you in the event of an emergency.

Comments closed

Be Wary Of Colliders When Analyzing Data

Keith Goldfeld has an interesting demonstration of a collider variable and how it can lead us to incorrect conclusions during analysis:

In this (admittedly thoroughly made-up though not entirely implausible) network diagram, the test score outcome is a collider, influenced by a test preparation class and socio-economic status (SES). In particular, both the test prep course and high SES are related to the probability of having a high test score. One might expect an arrow of some sort to connect SES and the test prep class; in this case, participation in test prep is randomized so there is no causal link (and I am assuming that everyone randomized to the class actually takes it, a compliance issue I addressed in a series of posts starting with this one.)

The researcher who carried out the randomization had a hypothesis that test prep actually is detrimental to college success down the road, because it de-emphasizes deep thinking in favor of wrote memorization. In reality, it turns out that the course and subsequent college success are not related, indicated by an absence of a connection between the course and the long term outcome.

Read the whole thing.  H/T R-Bloggers

Comments closed

XGBoost In R

Fisseha Berhane explains how to implement Extreme Gradient Boosting in R:

What makes it so popular are its speed and performance. It gives among the best performances in many machine learning applications. It is optimized gradient-boosting machine learning library. The core algorithm is parallelizable and hence it can use all the processing power of your machine and the machines in your cluster. In R, according to the package documentation, since the package can automatically do parallel computation on a single machine, it could be more than 10 times faster than existing gradient boosting packages.

xgboost shines when we have lots of training data where the features are numeric or a mixture of numeric and categorical fields. It is also important to note that xgboost is not the best algorithm out there when all the features are categorical or when the number of rows is less than the number of fields (columns).

xgboost is a nice complement to neural networks, as they tend to be great at different things.

Comments closed

Data Cleansing With R

I continue my series on launching a data science project:

Now that we’ve performed some basic analysis, we will clean up the data set. I’m doing most of the cleanup in a single operation, but I do have some comment notes here, particularly around the oddities with SalaryUSD. The SalaryUSD column has a few problems:

  • Some people put in pennies, which aren’t really that important at the level we’re discussing. I want to strip them out.
  • Some people put in delimiters like commas or decimal points (which act as commas in countries like Germany). I want to strip them out, particularly because the decimal point might interfere with my analysis, turning 100.000 to $100 instead of $100K.
  • Some people included the dollar sign, so remove that, as well as any spaces.

It’s not a perfect regex, but it did seem to fix the problems in this data set at least.

Something I’ve liked about the data professionals survey is that there are a few places with room for data cleansing, but not everything is awful.  It’s neither artificially clean nor beyond repair, so it’s good for use as an example.

Comments closed

Dropping Columns With Logstash

Mike Hillwig shows how to ignore columns with Logstash:

Like I said earlier, we have some data that I know I’ll never use. This is flight performance data. The dataset contains diversion information. If a flight gets diverted more than once, it’s tracked here. I don’t care about that, so I’m dropping the diversion information for the second through fifth diversions. I’m also dropping some information about the airports that I believe I won’t need. This is the tricky part. Somewhere down the road, I’m going to need to enhance this data by converting all of the times to UTC.

Mike’s slowly building up to a complete, working example and it’s interesting to watch the progress along the way.

Comments closed

Columnstore And Merge Replication

Niko Neugebauer tests whether merge replicated tables can use columnstore indexes:

Adding this table to the publication will end up with the following, self-explaining error message, being very clear that the Clustered Columnstore Indexes are not supported for the Merge Replication[.]

There is no surprise here, as the same Clustered Columnstore Indexes are not supported for the Transactional Replication, but I feel that a great opportunity is lost and the Replication technology are being quite ignored by the emerged technologies, such as In-Memory & Columnstore, where the scenarios of replicating the Data Warehousing data is something that a lot of people can find very useful.

I wish it would be otherwise, and this would allow to bring more customers to use Columnstore Indexes.

Clustered columnstore indexes aren’t possible, but read on to learn whether non-clustered columnstore indexes are supported.

Comments closed