Press "Enter" to skip to content

Author: Kevin Feasel

Data Cleansing: Hockey Edition

Stacia Varga has a post covering some of the yeoman’s work of data cleansing:

For now, Power BI continues to my tool of choice for my project. My goals for today’s post are two-fold: 1) finish my work to address missing venues in the games table and 2) to investigate the remaining anomalies in the games and scores tables as I noted in my last post.

To recap, I noted the following data values that warranted further investigation :

  • Total Goals minimum of 0 seems odd – because hockey games do not end in ties. I would expect a minimum of 0 so I need to determine why this number is appearing.

  • Total Goals maximum of 29 seems high – it implies that either one team really smoked the opposing team or that both teams scored highly. I’d like to see what those games look like and validate the accuracy.

  • Record Losses minimum of 0 seems odd also – that means at least one team has never had a losing season?

  • Similarly, Record Wins minimum of 0 means one team has never won?

  • Record OT minimum of 0 – I’m not sure how to interpret. I need to look.

  • Score minimum of 0 seems to imply the same thing as Total Goals minimum of 0, which I have already noted seems odd.

This is the kind of stuff that we talk about as taking 80-95% of a data science team’s time.  It’s all about finding “weird” looking values, investigating those values, and determining whether the input data really was correct or if there was an issue.

Comments closed

Tuning Recommendations In SQL Server 2017

Kendra Little shows that even if you don’t want to use automatic tuning in SQL Server 2017, you can still see the tuning recommendations:

Even though automatic tuning wasn’t enabled, SQL Server picked up on the performance changes. I got a recommendation in sys.dm_db_tuning_recommendations.

  • reason: Average query CPU time changed from 2127.84ms to 66291.9ms
  • state: {“currentValue”:”Active”,”reason”:”AutomaticTuningOptionNotEnabled”}

The details also include the query id in question, and the plan_id of the “fast plan”.

It’s nice to check those out for a couple of weeks before turning automatic tuning on; that way, you can get more comfortable with the types of changes the tuning engine recommends, and if you happen to have a system which is terrible for automatic tuning, you can know that before turning the feature on.

Comments closed

Using Python Within R

David Smith points out new reticulate package:

With reticulate, you can:

  • Import objects from Python, automatically converted into their equivalent R types. (For example, Pandas data frames become R data.frame objects, and NumPy arrays become R matrix objects.)

  • Import Python modules, and call their functions from R

  • Source Python scripts from R

  • Interactively run Python commands from the R command line

  • Combine R code and Python code (and output) in R Markdown documents, as shown in the snippet below

The first thing that came to mind when reading this was the implementation of the keras package in R and how it calls out to TensorFlow (written in Python).  The ability to make R vs Python an “and” instead of an “or” proposition is quite powerful.

Comments closed

HDFS Erasure Coding

Ayush Tiwari explains how HDFS Erasure Coding lets us keep the same data durability while improving storage efficiency:

In this particular example, when you look at the codeword, actual data cells are 6(blue cells) & 3(red cells) are the parity cells which are simply obtained by multiplied our data cells to generation matrix.
Storage failure can be recovered by the multiplying inverse of generator matrix with the extended codewords as long as ‘k’ out of ‘k+m’ cells are available.
Therefore, here Data Durability is 3 as it can handle 2 simultaneous failure, Storage Efficiency is 67% (as we are using only 1 extra block i.e. 6/9) & only we need to store half number of cells as compared to original number of cells, we can conclude that we also have only 50% overhead in it.

It’s a good explanation of one of the biggest improvements to HDFS over the past several years.

Comments closed

Ignoring Case In Table.Distinct With Power Query

Cedric Charlier shows how to ignore case when running Table.Distinct in Power Query:

Note that at the top of the table, I’ve twice the value ADO.NET, once with uppercase and once lowercase for the “Net” part. If I try to apply a Table.Distinct, the two values will be considered as distinct and one of them won’t be removed.

Will it be an issue? If this your key and it’s part of a model where this key is part of one side of a one-to-many then it will be an issue.

Read on for the solution.

Comments closed

Using Cosmos DB For Graph Data

Jose Mendes has an introduction to the Cosmos DB graph engine:

Gremlin is the graph traversal language of Apache TinkerPop, an open source Graph Computing Framework. Gremlin allows the users to write complex queries to traverse their graphs by using a composed sequence of steps, with each step performing an operation on the data stream (further details here). There are 4 fundamental steps:

· transform: transform the objects in the stream

· filter: remove objects from the stream

· sideEffect: pass the object, but yield some side effect

· branch: decide which step to take

Click through for a quick example showing how to create and populate a graph.

Comments closed

Will It Bit?

Louis Davidson wants to see what he can cast to a bit type:

There are no other textual/alpha string values that will cast to a bit value, but the numeric values that will cast to a bit are voluminous (even some that are in string format). Consider the following eight statements:

SELECT CAST(100 AS bit);
SELECT CAST(-100 AS bit);
SELECT CAST(99999999999999999999999999999999999999 AS bit);
SELECT CAST(-99999999999999999999999999999999999999 AS bit);
SELECT CAST(88.999999 AS bit);
SELECT CAST('1' AS bit);
SELECT CAST('2' AS bit);
SELECT CAST('999999' AS bit);

Danged if they didn’t all work, and all return 1.

Check out what else Louis tries to cast to a bit type.

Comments closed

All About Power BI Licensing

Reza Rad has a guide for the perplexed when it comes to licensing Power BI:

The pricing table above may scare you off and you may immediately think of not going through the embedded path. However, I need to let you know that there are some scenarios which Power BI Embedded can be a much more cost-effective option than Pro. Here is an example:

Assume that you have 100 users for your Power BI solution. And your users are not connecting all at the same time to use Power BI reports. You may have the maximum of 300 page renders per hour for them if you use embedded. In such case, embedded for that scenario would cost you about $700 USD per month, where the Power BI Pro for 100 users would be $1000 USD per month. This means saving of $3,600 USD per year. This is an example scenario that Power BI Embedded can be more cost-effective than Pro.

Give this a careful reading if you’re looking to implement Power BI in your environment.

Comments closed

Working With forcats

S. Richter-Walsh demonstrates what the forcats R package can do:

Synonymous factor levels

Sometimes a categorical variable may have two or more factor levels that refer to the same group. There may be subtle differences in syntax such as upper case leading letter versus lower case leading letter (GroupA vs. groupA), for example. In this situation, one can use forcats::fct_collapse() to collapse the synonymous levels into one. In our test data, let’s assume that Web and Online refer to the same sales channel and we want to combine both into a factor level called Online….

df$sales <- fct_collapse(df$sales, Online = c("Online", "Web"))

I don’t use forcats that often, but when I do, I definitely appreciate it being here.  H/T R-Bloggers

Comments closed

Tips For Processing Large Data Sets With Python

Julien Heiduk has a few tips for people looking to process large data sets within Python:

In order to aggregate our data, we have to use chunksize. This option of read_csvallows you to load massive file as small chunks in Pandas. We decide to take 10% of the total length for the chunksize which corresponds to 40 Million rows.
Be careful it is not necessarily interesting to take a small value. The time between each iteration can be too long with a small chaunksize. In order to find the best trade-off “Memory usage – Time” you can try different chunksize and select the best which will consume the lesser memory and which will be the faster.

Click through for more tips.

Comments closed