Press "Enter" to skip to content

Author: Kevin Feasel

Tidying Video Game Data

Arvid Kingl has a fun article analyzing data from an open-source video game and applying tidy data principles to it:

You will learn what key principles a tidy data set adheres to, why it is useful to follow them consequently, and how to clean the data you are given. Tidying is also a great way to get to know a new data set.

Finally, in this tutorial you will learn how to write a function that makes your analysis look much cleaner and allows you to execute repetitive elements in your analysis in a very reproducible way. The function will allow you to load the latest version of the data dynamically into a flexible data scheme, which means that large parts of the code will not have to change when new data is added.

Check it out. Bonus point: tidy data is Boyce-Codd Normal Form which is (potentially) subsequently widened back out to include dimensional information.

Comments closed

Troubleshooting Spark Performance

Bikas Saha and Mridul Murlidharan explain some of the basics of performance tuning with Apache Spark:

Our objective was to build a system that would provide an intuitive insight into Spark jobs that not just provides visibility but also codifies the best practices and deep experience we have gained after years of debugging and optimizing Spark jobs. The main design objectives were to be
– Intuitive and easy – Big data practitioners should be able to navigate and ramp quickly
– Concise and focused – Hide the complexity and scale but present all necessary information in a way that does not overwhelm the end user
– Batteries included – Provide actionable recommendations for a self service experience, especially for users who are less familiar with Spark
– Extensible – To enable additions of deep dives for the most common and difficult scenarios as we come across them

The tool looks pretty interesting and I’m hoping it will be part of the open source suite at Cloudera.

Comments closed

Basic Forensic Accounting Techniques

I continue my series on forensic accounting techniques:

Growth analysis focuses on changes in ratios over time. For example, you may plot annual revenue, cost, and net margin by year. Doing this gives you an idea of how the company is doing: if costs are flat but revenue increases, you can assume economies of scale or economies of scope are in play and that’s a great thing. If revenue is going up but costs are increasing faster, that’s not good for the company’s long-term outlook.

For our data set, I’m going to use the following SQL query to retrieve bus counts on the first day of each year. To make the problem easier, I add and remove buses on that day, so we don’t need to look at every day or perform complicated analyses.

I get into quite a bit in this post, including a quick tour of multicollinearity, which is only my second-favorite of the three linear regression amigos (heteroskedasticity being my favorite and autocorrelation the hanger-on).

Comments closed

Scaling Power BI Premium Capacity

Matt Allington gives us instructions on how to scale Power BI Premium capacity:

This is the third article in my series about how to make Power BI Premium more affordable for small to medium sized enterprises (SMEs).  In my first article I explained the problem and the logic behind how to configure a workable solution. In my second article I provided step by step instructions on how to configure Flow to start/stop Power BI Premium capacities.  In the article today I am covering a way to scale the capacity up/down either on demand, or on a timed schedule.

The cloud is generally more expensive than on-prem, though it can potentially become cheaper if you are smart about scaling and have more scaling-friendly workloads. Matt even provides a really cool cost analysis to help you figure out what (if anything) you end up saving using this technique.

Comments closed

T-SQL Tuesday 113 Roundup

Todd Kleinhans takes us through T-SQL Tuesday #113:

Wow, we had a variety of responses to the April 2019 topic of “What Do YOU Use Databases For?

I think the overall response to the question and the theme is both mixed and varied.

I have been struggling with the personal use of databases for a long time. Things I wish would have been easier but seems to just get more complicated over time. Ever heard of GDPR? Although we think we have absolute control and access to data about ourselves, we really do not. The right to be forgotten is NOT the same as having access to all of the data about ourselves in all of the systems before they disappear. Sometimes companies will delete your data about you before you ask.

Todd starts out with an essay and then moves on to the roundup.

Comments closed

When Scans are Superior to Seeks

Brent Ozar shows that index seeks are not always better than index scans:

Somewhere along the way in your career, you were told that:
– Index seeks are quick, lightweight operations
– Table scans are ugly, slow operations

And ever since, you’ve kept an eye on your execution plans looking for those performance-intensive clustered index scans. When you see ’em, you go root ’em out, believing you’ve got a performance problem.

Thing is, … they lied to you. Seeks aren’t necessarily good, nor are scans necessarily bad. To straighten you out, we’re going to walk through a series of demos.

The rule of thumb I like to use is: if you need to go through more than 20% of the data, you’re generally better off scanning. If you need to go through less than 0.5% of the data, you’re generally better off seeking. Everything in between is the “it depends” zone.

Comments closed

Clustered Columnstore Index Memory Timeouts

Joe Obbish takes a deep look at a clustered columnstore index insertion scenario:

Why should we care about memory grant timeouts for CCI insert queries? Simply put, lots of bad things can happen when those queries can time out, both for serial and for parallel inserts. For serial insert queries, I’ve observed deadlocks, extremely poor performance along with long SLEEP_TASK waits, and extremely long rollbacks. For parallel insert queries, I’ve observed queries that run seemingly forever, poor performance of the SELECT part, and error 8645. You probably don’t want any of that occurring in production. It would be very helpful if it was possible to extend the 25 second time-out for queries that insert into columnstore tables.

Read through as Joe learns the true meaning of Christmas a KB article.

Comments closed

Reviewing the Stack Overflow Developer Survey

Michael Toth looks at the recently-released 2019 Stack Overflow Developer Survey:

Since 2011, Stack Overflow has been surveying their users each year to answer questions about the technologies they use, their work experience, their compensation, and their satisfaction at work. Given Stack Overflow’s place in the broader programming world, they are able to draw quite the audience for their annual surveys.

This year, nearly 90,000 developers participated in the survey! There’s a lot in this survey, and I recommend reviewing it yourself, but I wanted to surface some of the key findings that I thought were particularly relevant to data professionals here.

Stack Overflow says they will be releasing the underlying data for this survey in the coming weeks, so I hope to return to this for a deeper analysis once that’s made available. For now, let’s get into the results!

Michael’s lede involves R versus Python in terms of salaries, but for me, the top line is that functional programmers make more money. Clojure, F#, Scala, Elixir, and Erlang make the top 10 on the global list, including positions 1, 2, 4, and 5. Within the US, Scala, Clojure, Erlang, Kotlin, F#, and Elixir make the top 10, including positions 1, 2, and 4. H/T R-Bloggers

Comments closed