Press "Enter" to skip to content

Month: February 2019

Hiding Work: The Nested Loop Operator

Erik Darling explains that the nested loop operator is like a duck: there’s more going on beneath the surface than it lets on:

I’m going to talk about my favorite example, because it can cause a lot of confusion, and can hide a lot of the work it’s doing behind what appears to be a friendly little operator.

Something to keep in mind is that I’m looking at the actual plans. If you’re looking at estimated/cached plans, the information you get back may be inaccurate, or may only be accurate for the cached version of the plan. A query plan reused by with parameters that require a different amount of work may have very different numbers.

I like nested loop joins a lot, but there’s a big difference between a loop running a few dozen times and a loop running a couple hundred thousand times, even if the operator doesn’t show you that immediately.

Comments closed

Solving The Monty Hall Problem With R

Miroslav Rajter builds a Monty Hall problem simulator using R:

The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is a loss (Goat). You first choose a door (let’s say door A). The contest host then opens another door behind which is a goat (let’s say door B), and then he ask you will you stay behind your original choice or will you switch the door. The question behind this is what is the better strategy?

This is something that puzzled me for a very long time. This is fundamentally a Bayesian problem built around processing new information, and once I understood that, the answer was a lot clearer. H/T R-Bloggers.

Comments closed

Control Table Keys In cdata

John Mount announces a new feature in the cdata package:

In our cdata R package and training materials we emphasize the record-oriented thinking and how to design a transform control table. We now have an additional exciting new feature: control table keys.
The user can now control which columns of a cdata control table are the keys, including now using composite keys (that is keys that are spread across more than one column). This is easiest to demonstrate with an example.

Read on for an example of how you can use this.

Comments closed

Using Calendar Tables

I have a post up on using calendar tables:

There’s one problem with picking a SQL Saturday in April: Easter and Passover tend to run right around that time, and nobody wants a SQL Saturday on Passover or the day before Easter. Unfortunately, our calendar table doesn’t include holiday information. So let’s add it!

Working with holidays and working with fiscal years versus calendar years are just two of the uses of calendar tables. But they’re the only two that I show.

Comments closed

Finding The Last Non-Null Value With Snowflake

Koen Verbeeck shows how two words makes solving a problem with Snowflake a lot easier than with SQL Server:

Sometimes you need to find the previous value in a column. Easy enough, the LAG window function makes this a breeze (available since SQL Server 2012). But what if the previous value cannot be null? You can pass a default, but we actually need the previous value that was not null, even if it is a few rows back. This makes it a bit harder. T-SQL guru Itzik Ben-Gan has written about the solution to this problem: The Last non NULL Puzzle. It’s a bit of tricky solution. 

Click through for the magic words and if you’re on the SQL Server side, upvote this issue to get that functionality in SQL Server too.

Comments closed

Syncing Slicers In Power BI

Prathy Kamasani takes us through a recently added feature in Power BI:

As per Microsoft docs:
“This feature lets you create a custom group of slicers to keep synchronized. A default name is provided, but you can use any name you prefer.
The group name provides additional flexibility with slicers. You can create separate groups to sync slicers that use the same field, or put slicers that use different fields into the same group.”

First, let’s look at creating groups to sync slicers that use the same field. The use case Syncing within a page, we can easily use the group functionality to do this.

Click through for a few demos of increasing complexity.

Comments closed

A Rant About ORMs

Ned Otter is not a fan of ORMs:

I’ve seen a lot of tech come and go in my time, but nothing I’ve seen vexes me more than “framework generated SQL”.  No doubt I’m ignorant about some aspects of it, but its usage continues to confound many a DBA.

To troubleshoot one of these bad boys, you might consider Google Glass, but it will fail you. The first issue is that these crappy frameworks generate a code tsunami that’s almost (or actually) unreadable by humans. The tables you know and love are aliased with names such as “Extent1” and the like. Multiple nestings of that, and it’s all gobbledygook aka spaghetti code.

These work great as long as you have more hardware to throw at the problem.

I would differentiate here a micro-ORM like Dapper from a Hibernate or Entity Framework like Ned has in mind, where the difference is that Dapper acts as a way of automating the data access layer but you still write the SQL queries or stored procedures.

Comments closed

Using The ROWVERSION Type For ETL

Max Vernon shows us how to use the ROWVERSION data type to tell how much work you have to do to ETL data over from one table to another:

The OLTP table implements a rowversion column that is automatically updated whenever a row is updated or inserted. The rowversion number is unique at the database level, and increments monotonically for all transactions that take place within the context of that database. The dbo.OLTP_Updates table is used to store the minimum row version available inside the transaction used to copy data from the OLTP table into the OLAP table. Each time this code runs it captures incremental changes. This is far more efficient than comparing all the rows in both tables using a hashing function since this method doesn’t require reading any data other than the source data that is either new, or has changed.

I think this is the first time I’ve seen someone use ROWVERSION types successfully.

Comments closed

Effective Identities And Power BI Embedded

Angela Henry shows how you can use Power BI Embedded for row-level security even when the accessing users don’t have Power BI accounts:

Now that you familiar with Row Level Security in Power BI, how do you make it work when you want to pass in your customer’s identifier because your customers don’t have Power BI accounts?  It seems like the only way to make dynamic row level security is to use the Username() DAX function?  But wait, doesn’t that require the user to have a Power BI account?  Sigh, it seems we are going in circles.

The one thing these articles don’t talk about is that when you are using Power BI Embedded, you can pass in whatever you like for the EffectiveIdentity via the Power BI API and it will “overwrite” the Username() function.  What?!  That’s right, it will completely ignore the Username() function and use whatever you give it.  WooHoo!

Read on for the details.

Comments closed

No-Code ML On Cloudera Data Science Workbench

Tim Spann has a post covering ML on the Cloudera Data Science Workbench:

Using Cloudera Data Science Workbench with Apache NiFi, we can easily call functions within our deployed models from Apache NiFi as part of flows. I am working against CDSW on HDP (https://www.cloudera.com/documentation/data-science-workbench/latest/topics/cdsw_hdp.html),  but it will work for all CDSW regardless of install type.
In my simple example, I built a Python model that uses TextBlob to run sentiment analysis against a passed-in sentence. It returns Sentiment Polarity and Subjectivity, which we can immediately act upon in our flow.
CDSW is extremely easy to work with and I was up and running in a few minutes. For my model, I created a python 3 script and a shell script for install details. Both of these artifacts are available here: https://github.com/tspannhw/nifi-cdsw.

The “no code” portion was less interesting to me than the scalable ML portion, as “no code” either drops into tedium or ends up being replaced by code.

Comments closed