Press "Enter" to skip to content

Month: March 2019

Pivot Tables and Grouping Sets With data.table in R

Jozef Hajnala shows how you can use data.table to perform fast pivoting and arbitrary grouping of data in R:

Data manipulation and aggregation is one of the classic tasks anyone working with data will come across. We of course can perform data transformation and aggregation with base R, but when speed and memory efficiency come into play, data.table is my package of choice.

In this post we will look at of the fresh and very useful functionality that came to data.table only last year – grouping sets, enabling us, for example, to create pivot table-like reports with sub-totals and grand total quickly and easily.

Grouping sets are also available in SQL dialects and tend to be something people tend not to be aware of. This is a shame because they’re quite powerful, and Jozef shows how powerful they can be in R.

Comments closed

Flowcharts in R

Anisa Dhana builds a sample flowchart with DiagrammeR:

After some search, I found that there are a few packages in R which allow making exemplary flowcharts. The one which I found easy to use was DiagrammeR. The advantage of this packages is that generate diagrams using code within R Markdown syntax.

The taped-glasses nerd in me wants to point out that flow charts use geometric shapes to show flow and that this is more properly labeled a graph (the examples are directed acyclic graphs), but hush, taped-glasses nerd self.

Comments closed

Analyzing Autosteer Data (Or Lack Thereof)

Elliot Williams has an interesting analysis of the NHTSA report on Tesla’s Autosteer capabilities:

But the NHTSA report went a step further. Based on the data that Tesla provided them, they noted that since the addition of Autosteer to Tesla’s confusingly named “Autopilot” suite of functions, the rate of crashes severe enough to deploy airbags declined by 40%. That’s a fantastic result.

Because it was so spectacular, a private company with a history of investigating automotive safety wanted to have a look at the data. The NHTSA refused because Tesla claimed that the data was a trade secret, so Quality Control Systems (QCS) filed a Freedom of Information Act lawsuit to get the data on which the report was based. Nearly two years later, QCS eventually won.

Looking into the data, QCS concluded that crashes may have actually increased by as much as 60% on the addition of Autosteer, or maybe not at all. 

This is a great exercise in statistical analysis and the problem of garbage in, garbage out.

Comments closed

Using the ALL Function in DAX

Reza Rad shows us how we can make use of the ALL function in a DAX query:

Among all the functions in DAX; the behavior of ALL function still seems mysterious for many. Many people, don’t use it at all and end up writing a very complicated calculation for a scenario that only one simple expression can do the same job. Some people, use it, but don’t exactly know how the function works, and get some unexpected results, and call it an error. In this article, I’m going to explain what ALL function is, how it can be used, and what are use cases of using such a function in DAX and Power BI. If you like to learn more about Power BI, read the Power BI book from Rookie to Rock Star.

Read on to see how the function behaves. Reza does a good job getting into the nuance of this function.

Comments closed

Thoughts on Certification

Eugene Meidinger is certifiable:

This being a complex topic, I thought I’d lay out the various factors to give a more comprehensive answer than you can easily fit in a tweet.

So the first two questions we need to answer are “Why do certs exist?” and “Why do people take them?”. Without these, we can’t give a good answer to whether you should take them. Certifications often exist for reasons that have nothing to do with your personal best interest. It is necessary to understand that fact.

Giving the economist’s spin, certifications are imperfect signals of reputation. When you know nothing else about a candidate, business partner, vendor, or ranting homeless person on the street, that cert can let you update your priors about the person. The exclusivity of the certification goes a long way in building credence: the MCM (or MCSM) has such a positive reputation even years after its cancellation because it was so difficult an exam that the only way a person could pass is if that person really knew the topic extremely well. By contrast, the old MCSE certifications from the early 2000s were a joke because anybody could memorize a brain dump, spit out answers, and get a cert.

The economist in me also says that certifications tend to be a net drain because you’re spending time on an imperfect signal when there are probably better imperfect signals out there. Your blog, YouTube/Twitch channel (assuming you’re not just playing Slay the Spire all day), and GitHub repo are going to tell me more about your interests and technical capabilities.

Read what Eugene has to say. I think we agree on the broad strokes but I’m probably more in the “not worth it” camp than he is with the exception of cases where it’s necessary to land a business contract (e.g., needing to be a Microsoft Gold Partner).

1 Comment

Investigating THREADPOOL Waits

Erik Darling has an interesting query against sys.dm_exec_query_stats to help you determine what might cause your THREADPOOL wait problems:

When THREADPOOL strikes, even the best monitoring tools can have a bunch of blank spots hanging around in them.

If you’re on SQL Server 2016 or better, there are some helpful columns in sys.dm_exec_query_stats.

Erik also has advice to help you out if you are running into these waits.

Comments closed

Baselining SQL Server with the First Responder Kit

Ajay Dwivedi has a GitHub project showing a method to collect baseline measures for SQL Server using Brent Ozar’s First Responder Kit:

With sp_BlitzFirst & sp_WhoIsActive in a SQL Agent job with scheduled execution for every 10-15 minutes, you can look back in time in terms of What was running, its execution stats, file stats, wait stats and Perfmon counters. This would help you to answer anybody as why your server was slow at a particular point in time.

Click through for the post and check out the GitHub repo as well. Baselining is extremely important for proper administration.

Comments closed

Converting One Column Into Multiple With Power Query

Imke Feldmann shows us how to pivot a single column into a fixed number of columns using Power Query:

The demand to unstacking a column into a table is not rare (see here for example: PowerBIForum  ) . Also if you copy a table from a post in the Power BI community forum  to the enter-data-section in Power BI, it will show up as such a one-column-table.

Note that this is different from the Entity-Attribute-Value model, as there’s no entity or attribute—it’s just values.

Comments closed

Sample Spark-Submit Config Settings

Leela Prasad shares a few sample configuration settings for Spark-Submit jobs:

Before going further let’s discuss on the below parameters which I have given for a Job.
spark.executor.cores=5 
spark.executor.instances=3
spark.executor.memory=20g
spark.driver.memory=5g 
spark.dynamicAllocation.enabled=true 
spark.dynamicAllocation.maxExecutors=10 

Click through to see what these do and why Leela chose these settings. The Spark documentation has the full list of settings but it’s good to hear explanations from practitioners.

Comments closed