Press "Enter" to skip to content

Curated SQL Posts

NFL Player Stats In Power BI

Dustin Ryan shares his NFL player stats and analysis Power BI desktop file:

I’ve had a lot of people ask me for this over the past few months and its finally (mostly) ready! There are still a few things I’d like to do with the data models and reports but I wanted to go ahead and get the content shared out since I know many people use this for the Fantasy Football drafts which generally happen during the third week of the NFL preseason.

So here it is. I’ve spent a decent amount of time scraping the data from a few different websites in order to put something together I thought would be useful and fun, so please take a look and enjoy it!

Click through for the file and a YouTube video with more info.

Comments closed

Moving Data Between Data Lakes

Jeffrey Verheul shows us how to use AdlCopy to migrate data from one Azure Data Lake to another:

Migrating data from one Data Lake to the other
We started out with a test version of a Data Lake, and this week I needed to migrate data to the production version of our Data Lake. After a lot of trial and error I couldn’t find a good way to migrate data. In the end I found a tool called AdlCopy. This is a command-line tool that copies files for you. Let me show you how easy it is.

Download & Install
AdlCopy needs to be installed on your machine. You can find the download here. By default the tool will install the files in “C:\Users\\Documents\AdlCopy\”, but this can be changed in the setup wizard.

Click through to see how to use this tool.

Comments closed

Creating Timelines With dbatools

Marcin Gminski shows how to pull SQL Agent and backup history out of SQL Server and display it as a visual history timeline:

Currently, the output from the following commands is supported:

  • Get-DbaAgentJobHistory
  • Get-DbaBackupHistory

You will run the above commands as you would normally do but pipe the output to ConvertTo-DbaTimeline, the same way as you would with any other ConverTo-* PowerShell function. The output is a string that most of the time you will save as file using the Out-File command in order to open it in a browser.

Then, with the ConvertTo-DbaTimeline cmdlet, you can convert that into an HTML page which looks pretty good.

Comments closed

The Basics Of DAX

Matthew Brice walks us through filters and calculations in DAX:

CALCULATE is somewhat unique in that it evaluates the 2nd, 3rd, …nth parameter first, and evaluates the first parameter last using values from my Filter Context Box. I think it is extremely helpful to list briefly the steps CALCULATE performs whenever it is invoked. (So maybe we are not at 10,000 feet, but 5,000?)

The CALCULATE function performs the following operations:

  1. Create a new filter context by cloning the existing one. (***Important visual step!***)

  2. Move rows in the row context to the new clone filter context box one by one replacing filters if it references the same column. (We will ignore this step for this post)

  3. Evaluate each filter argument to CALCULATE in the old filter context and then add column filters to the new clone filter context box one by one, replacing column filters if it references the same column.

  4. Evaluate the first argument in the newly constructed filter context.

  5. Destroy this newly created, cloned filter context box before moving on to calculating the next “cell.”

If you’re interested in getting started with DAX, this is a good place to begin.

Comments closed

Dealing With Multicollinearity With R

Chaitanya Sagar explains the concept of multicollinearity in linear regressions and how we can mitigate this issue in R:

Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, to the set of independent variables. Now, this new variable, Z, does not add any significant or different value than provided by X or Y. The model can adjust itself to set the parameters that this combination is taken care of while determining the coefficients.

Multicollinearity may arise from several factors. Inclusion or incorrect use of dummy variables in the system may lead to multicollinearity. The other reason could be the usage of derived variables, i.e., one variable is computed from other variables in the system. This is similar to the example we took at the beginning of the article. The other reason could be taking variables which are similar in nature or which provide similar information or the variables which have very high correlation among each other.

Multicollinearity can make regression analysis trickier, and it’s worth knowing about.  H/T R-bloggers.

Comments closed

When Cassandra Makes Sense

Anmol Sarna explains the pros and cons of using Apache Cassandra:

But as we know nothing is perfect. So is the Cassandra Database. What I mean by this is that you cannot have a perfect package. If you wish for one brilliant feature then you might have to compromise on the other features. In today’s blog, we will be going through some of the benefits of selecting Cassandra as your database as well as the problems/drawbacks that one might face if he/she chooses Cassandra for his/her application.
I have also written some blogs earlier which you can go through for reference if you want to know What Cassandra isHow to set it up and how it performs its Reads and Writes.

The only question we have is that should we or should we not pick Cassandra over the other databases that are available. So let’s start by having a quick look at when to use the Cassandra Database. This will give a clear picture to all those who are confused in decided whether to give Cassandra a try or not.

This is a level-headed analysis of Cassandra, so check it out.

Comments closed

What To Watch When Using VSS Snapshots

Erik Darling shows us the wait stats associated with the Volume Shadow Copy Service (VSS):

A while back I wrote about the Perils of VSS Snaps.

After working with several more clients having similar issues, I decided it was time to look at things again. This time, I wanted blood. I wanted to simulate a slow VSS Snap and see what kind of waits stats I’d have to look out for.

Getting software and rigging stuff up to be slow would have been difficult.

Instead, we’re going to cheat and use some old DBCC commands.

This one almost got the “Wacky Ideas” tag but I’m grading on a curve for that category.

Comments closed

Your Data’s Not That Big

Larry White throws a bit of cold water on the distributed computing movement:

Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because it was “easy,” which it was, if you ignore all the work needed to make it go fast. It seemed pretty clear to me that it could have been written in Java to run on a single machine with a much smaller staff.

One definition of “big data” is “Data that is too big to fit on one machine.” By that definition what is “big data” for one language is plain-old “data” for another. Java, with it’s efficient memory management, high performance, and multi-threading can get a lot done on one machine. To do data science in Java, however, you need data science tools: Tablesaw is an open-source (Apache 2) Java data science platform that lets users work with data on a single machine. It’s a dataframe and visualization framework. Most data science currently done in clusters could be done on a single machine using Tablesaw paired with a Java machine learning library like Smile.

But you don’t have to take my word for that.

There are some interesting thoughts in this post, but there are limits to what a single machine can do.

Comments closed

Including R Visuals In Power BI Dashboards

Parker Stevens shows how to include R visuals in a Power BI dashboard:

Let’s finish up this post with a quick example of how to code the elusive line chart with two y-axes. This always seems to be asked in the forums and it’s pretty easy to implement.

Follow the same steps as shown above to bring in a new R visual. Since we need a column to pass into the visual and open up the editor, let’s just throw in the Angle field that we made previously. With the code editor available we can start writing the R script. In this example, we are going to need some data that is available in a specific R package, called “ggplot2.” Go ahead and install the package by typing the following code the same way we installed scatterplot3d:

install.packages(“ggplot2”)

There are two interesting examples here, including one which accepts an external parameter.

Comments closed

Deploying To Power BI Report Server Using Powershell

Rob Sewell shows us how to automate Power BI Report Server deployments:

But I dont want to have to do this each time and there will be multiple pbix files, so I wanted to automate the solution. The end result was a VSTS or TFS release process so that I could simply drop the pbix into a git repository, commit my changes, sync them and have the system deploy them automatically.

As with all good ideas, I started with a google and found this post by Bill Anton which gave me a good start ( I could not get the connection string change to work in my test environment but this was not required so I didnt really examine why)

I wrote a function that I can use via TFS or VSTS by embedding it in a PowerShell script.

Click through for the script.

Comments closed