Press "Enter" to skip to content

Day: April 17, 2019

Pivoting Spark DataFrames

Unmesha Sreeveni shows how we can pivot a DataFrame in Apache Spark using one line of code:

A pivot can be thought of as translating rows into columns while applying one or more aggregations.

Lets see how we can achieve the same using the above dataframe.

We will pivot the data based on “Item” column.

Click through for the code. This is an area where dropping back into Scala or Python is a lot more lines-of-code efficient than sticking to SQL.

Comments closed

Bayes’ Theorem In A Picture

Stephanie Glen gives us the basics of Bayes’ Theorem in a picture:

Bayes’ Theorem is a way to calculate conditional probability. The formula is very simple to calculate, but it can be challenging to fit the right pieces into the puzzle. The first challenge comes from defining your event (A) and test (B); The second challenge is rephrasing your question so that you can work backwards: turning P(A|B) into P(B|A). The following image shows a basic example involving website traffic. For more simple examples, see: Bayes Theorem Problems.

Click through for the image and related links.

Comments closed

Tidying Video Game Data

Arvid Kingl has a fun article analyzing data from an open-source video game and applying tidy data principles to it:

You will learn what key principles a tidy data set adheres to, why it is useful to follow them consequently, and how to clean the data you are given. Tidying is also a great way to get to know a new data set.

Finally, in this tutorial you will learn how to write a function that makes your analysis look much cleaner and allows you to execute repetitive elements in your analysis in a very reproducible way. The function will allow you to load the latest version of the data dynamically into a flexible data scheme, which means that large parts of the code will not have to change when new data is added.

Check it out. Bonus point: tidy data is Boyce-Codd Normal Form which is (potentially) subsequently widened back out to include dimensional information.

Comments closed

Troubleshooting Spark Performance

Bikas Saha and Mridul Murlidharan explain some of the basics of performance tuning with Apache Spark:

Our objective was to build a system that would provide an intuitive insight into Spark jobs that not just provides visibility but also codifies the best practices and deep experience we have gained after years of debugging and optimizing Spark jobs. The main design objectives were to be
– Intuitive and easy – Big data practitioners should be able to navigate and ramp quickly
– Concise and focused – Hide the complexity and scale but present all necessary information in a way that does not overwhelm the end user
– Batteries included – Provide actionable recommendations for a self service experience, especially for users who are less familiar with Spark
– Extensible – To enable additions of deep dives for the most common and difficult scenarios as we come across them

The tool looks pretty interesting and I’m hoping it will be part of the open source suite at Cloudera.

Comments closed

Basic Forensic Accounting Techniques

I continue my series on forensic accounting techniques:

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.

For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

I get into quite a bit in this post, including a quick tour of multicollinearity, which is only my second-favorite of the three linear regression amigos (heteroskedasticity being my favorite and autocorrelation the hanger-on).

Comments closed

Scaling Power BI Premium Capacity

Matt Allington gives us instructions on how to scale Power BI Premium capacity:

This is the third article in my series about how to make Power BI Premium more affordable for small to medium sized enterprises (SMEs).  In my first article I explained the problem and the logic behind how to configure a workable solution. In my second article I provided step by step instructions on how to configure Flow to start/stop Power BI Premium capacities.  In the article today I am covering a way to scale the capacity up/down either on demand, or on a timed schedule.

The cloud is generally more expensive than on-prem, though it can potentially become cheaper if you are smart about scaling and have more scaling-friendly workloads. Matt even provides a really cool cost analysis to help you figure out what (if anything) you end up saving using this technique.

Comments closed

T-SQL Tuesday 113 Roundup

Todd Kleinhans takes us through T-SQL Tuesday #113:

Wow, we had a variety of responses to the April 2019 topic of “What Do YOU Use Databases For?

I think the overall response to the question and the theme is both mixed and varied.

I have been struggling with the personal use of databases for a long time. Things I wish would have been easier but seems to just get more complicated over time. Ever heard of GDPR? Although we think we have absolute control and access to data about ourselves, we really do not. The right to be forgotten is NOT the same as having access to all of the data about ourselves in all of the systems before they disappear. Sometimes companies will delete your data about you before you ask.

Todd starts out with an essay and then moves on to the roundup.

Comments closed

When Scans are Superior to Seeks

Brent Ozar shows that index seeks are not always better than index scans:

Somewhere along the way in your career, you were told that:
– Index seeks are quick, lightweight operations
– Table scans are ugly, slow operations

And ever since, you’ve kept an eye on your execution plans looking for those performance-intensive clustered index scans. When you see ’em, you go root ’em out, believing you’ve got a performance problem.

Thing is, … they lied to you. Seeks aren’t necessarily good, nor are scans necessarily bad. To straighten you out, we’re going to walk through a series of demos.

The rule of thumb I like to use is: if you need to go through more than 20% of the data, you’re generally better off scanning. If you need to go through less than 0.5% of the data, you’re generally better off seeking. Everything in between is the “it depends” zone.

Comments closed

Clustered Columnstore Index Memory Timeouts

Joe Obbish takes a deep look at a clustered columnstore index insertion scenario:

Why should we care about memory grant timeouts for CCI insert queries? Simply put, lots of bad things can happen when those queries can time out, both for serial and for parallel inserts. For serial insert queries, I’ve observed deadlocks, extremely poor performance along with long SLEEP_TASK waits, and extremely long rollbacks. For parallel insert queries, I’ve observed queries that run seemingly forever, poor performance of the SELECT part, and error 8645. You probably don’t want any of that occurring in production. It would be very helpful if it was possible to extend the 25 second time-out for queries that insert into columnstore tables.

Read through as Joe learns the true meaning of Christmas a KB article.

Comments closed