Press "Enter" to skip to content

Curated SQL Posts

Principal Component Analysis With Stack Overflow Data

Julia Silge explains Principal Component Analysis and shows us an example using Stack Overflow data:

We have tidy data, both because that’s what I get when querying our databases and because it is useful for exploratory data analysis when preparing for a machine learning algorithm like PCA. To implement PCA, we need a matrix, and in this case a sparse matrix makes most sense. Most developers do not visit most technologies so there are lots of zeroes in our matrix. The tidytext package has a function cast_sparse() that takes tidy data and casts it to a sparse matrix.

sparse_tag_matrix <- tag_percents %>%
    tidytext::cast_sparse(User, Tag, Value)

Several of the implementations for PCA in R are not sparse matrix aware, such as prcomp(); the first thing it will do is coerce the BEAUTIFUL SPARSE MATRIX you just made into a regular matrix, and then you will be sitting there for one zillion years with no RAM left. (That is a precise and accurate estimate from my benchmarking, obviously.) One option that does take advantage of sparse matrices is the irlba package.

This is a great walkthrough of an important topic.

Comments closed

Joining Tables In DAX

Marco Russo bridges a gap in moving from SQL to DAX:

The SQL language offers the following types of JOIN:

  • INNER JOIN
  • OUTER JOIN
  • CROSS JOIN

The result of a JOIN does not depends on the presence of a relationship in the data model. You can use any column of a table in a JOIN condition.

In DAX there are two ways you can obtain a JOIN behavior. First, you can leverage existing relationships in the data model in order to query data included in different tables, just as you wrote the corresponding JOIN conditions in the DAX query. Second, you can write DAX expressions producing a result equivalent to certain types of JOIN. In any case, not all the JOIN operations available in SQL are supported in DAX.

Read on for several examples.

Comments closed

dbachecks Improvements

Rob Sewell shows off some improvements in the dbachecks library:

With the latest release of dbachecks we have added a new check for testing that foreign keys and constraints are trusted thanks to Cláudio Silva b | t

To get the latest release you will need to run

You should do this regularly as we release new improvements frequently.

We have also added better descriptions for the checks which was suggested by the same person who inspired the previous improvement I blogged about here

Click through for more details.

Comments closed

Using xplain To Interpret Model Results

Joachim Zuckarelli walks us through the xplain package in R:

The above XML produces the following output (don’t worry too much about the call of xplain(), we will discuss later on in more detail how to work with the xplain() function):

library(car)
library(xplain)
xplain(call="lm(education ~ young + income + urban, data=Anscombe)", 
xml="http://www.zuckarelli.de/xplain/example_lm_foreach.xml")

##
## Call:
## lm(formula = education ~ young + income + urban, data = Anscombe)
##
## Coefficients:
## (Intercept) young income urban
## -286.83876 0.81734 0.08065 -0.10581
##
##
## Interpreting the coefficients
## —————————–
## Your coefficient ‘(Intercept)’ is smaller than zero.
##
## Your coefficient ‘young’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.82 for
## any increase of 1 in your independent variable ‘young’.
##
## Your coefficient ‘income’ is larger than zero. This means that the
## value of your dependent variable ‘education’ changes by 0.081 for
## any increase of 1 in your independent variable ‘income’.
##
## Your coefficient ‘urban’ is smaller than zero. This means that the
## value of your dependent variable ‘education’ changes by -0.11 for
## any increase of 1 in your independent variable ‘urban’.

I’ll be interested in looking at this in more detail, though my first glance indication is that it’ll be useful mostly in large shops with different teams creating and using models.

Comments closed

Sentiment Analysis Of Hotel California

Sara Locatelli analyzes the lyrics to Hotel California using tidytext:

Sentiment analysis is a method of natural language processing that involves classifying words in a document based on whether a word is positive or negative, or whether it is related to a set of basic human emotions; the exact results differ based on the sentiment analysis method selected. The tidytext R package has 4 different sentiment analysis methods:

  • “AFINN” for Finn Årup Nielsen – which classifies words from -5 to +5 in terms of negative or positive valence
  • “bing” for Bing Liu and colleagues – which classifies words as either positive or negative
  • “loughran” for Loughran-McDonald – mostly for financial and nonfiction works, which classifies as positive or negative, as well as topics of uncertainty, litigious, modal, and constraining
  • “nrc” for the NRC lexicon – which classifies words into eight basic emotions (anger, fear, anticipation, trust, surprise, sadness, joy, and disgust) as well as positive or negative sentiment

Sentiment analysis works on unigrams – single words – but you can aggregate across multiple words to look at sentiment across a text.

To demonstrate sentiment analysis, I’ll use one of my favorite songs: “Hotel California” by the Eagles.

Read the whole thing, though you can’t check out afterward.

Comments closed

Digging Into The SQL Compute Context With R Services

Niels Berglund dives into how the SQL Compute Context works with R Services:

In the code above we use the RxInSqlServer() function to indicate we want to execute in a SQL context. The connectionString property defines where we execute, and the numTasks property sets the number of tasks (processes) to run for each computation, in Code Snippet 4 it is set to 1 which from a processing perspective should match what we do in Code Snippet 3. Before we execute the code in Code Snippet 4 we do what we did before we ran the code in Code Snippet 3:

  • Run Process Explorer as admin.
  • Navigate to the devenv.exe process in Process Explorer.
  • In addition, also look at the Launchpad.exe process in Process Explorer.

When we execute we see that the BxlServer.exe processes under the Microsoft.R.Host.exe processes are idling, but when we look at the Launchpad.exe process we see this:

This is a bit deep but interesting reading.

Comments closed

Creating Choropleths With ggcounty

Sebastian Sauer has a quick example of using ggcounty to plot data on a map of US counties:

This posts shows how easy it can be to build an visually pleasing plot. We will use hrbrmster’s ggcounty, which is an R package at this Github repo. Graphics engine is as mostly in my plots, Hadley Wickhams ggplot. All build on R. Standing on shoulders…

Disclaimer: This example heavily draws on hrbrmster example on this page. All credit is due to Rudy, and those on whose work he built up on.

In just a few lines of code, you can have a pretty nice map.

Comments closed

Getting Month Names Given Numbers In DAX

Philip Seamark shows how to convert month numbers into names in Power BI:

The most common solution I see offered is along the lines of a SWITCH statement that lists 12 conditions (one for each month).   This works, but can also be done using existing functions.

While DAX lacks a dedicated function to convert a number to a text version, such as DATENAME in T-SQL, we can get there in two functions using DATEVALUE wrapped in a FORMAT.

To demonstrate, I will create a simple table with 13 values (1 through 13) using the following calculated table.

This creates a single column table with 13 rows.

Read on for the rest of the story.

Comments closed

Overlaying Visuals In Power BI

Annie Xu gives us two methods for being able to jump between two visuals in the same space:

Disconnected Table method:

This method is more towards PowerBI modelers. Basically, the idea is to have a Field in a independent table (no relationship to other tables) as Slicer with your measure choice and then create a measure using SELECTEDVALUE function to have the measure dynamically switch referring measures based on the choice made on the slicer.

Click through for both methods.

Comments closed

JSON Data In SSIS

Stacia Varga shows a few methods for handling JSON data in SQL Server Integration Services:

And then I had to write about it in my book Introducing Microsoft SQL Server 2016 (which is free to download) when JSON support was added to SQL Server 2016. But I still didn’t have clients using JSON. It was interesting to me that I could use SQL Server to work with JSON data, but it was still theoretical to me rather than practical.

Therefore, I never thought much about how I would handle it in SQL Server Integration Services (SSIS). I just didn’t have a reason.

Until now. This seems to be the year that I am bumping into JSON left and right. It’s everywhere!

Read on for those methods as well as Stacia’s recommendation.

Comments closed