Press "Enter" to skip to content

Month: February 2020

Making a Better Pie Chart

Elizabeth Ricks tries the impossible:

A friend called me recently and started our conversation with: “I know you dislike pie charts, but…can you help me create one?” 

Spoiler alert: I don’t hate pie charts. They’ve received a bad rap over the years and with good reason—they are very commonly used when another chart type would be better suited. The appropriate use case for a pie chart is expressing a part-to-whole relationship. Their limitation is that it can be difficult to accurately judge the relative size of and compare the segments. Here are some related articles on our blog: the great pie debate and an updated post on pies

Elizabeth does put together the best possible case, but I’m still in favor of burning pie charts to the ground.

Comments closed

Hive: Shuffle Failed with Too Many Fetch Failures

Dmitry Tolpeko takes us through an ugly error:

On one of the clusters I noticed an increased rate of shuffle errors, and the restart of a job did not help, it still failed with the same error.

The error was as follows:

Error: Error while running task ( failure ) : org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$ShuffleError: error in shuffle in Fetcher at org.apache.tez.runtime.library.common.shuffle.orderedgrouped.Shuffle$RunShuffleCallable.callInternal (Shuffle.java:301)

Caused by: java.io.IOException: Shuffle failed with too many fetch failures and insufficient progress!failureCounts=1, pendingInputs=1, fetcherHealthy=false, reducerProgressedEnough=true, reducerStalled=true

Click through to understand what this error means and what you can do about it.

Comments closed

Distinct Counts in Power Query

Reza Rad shows how you can get a distinct count in Power Query:

You can have a distinct count calculation in multiple places in Power BI, through DAX code, using the Visual’s aggregation on a field, or even in Power Query. If you are doing the distinct count in Power Query as part of a group by operation, however, the existing distinct count is for all columns in the table, not for a particular column. In this article, I’ll show you a method you can use to get the distinct count of a particular column through the Group By transformation in Power Query component of Power BI.

Click through to learn how.

Comments closed

Corruption and Secondary Databases

Paul Randal shares some wisdom on corruption:

We’ve had a few new clients come to use recently after experiencing corruption, and they’ve been worried about whether physical corruption can propagate to secondary databases (like an availability group secondary or log shipping secondary) through the mechanism used to maintain the secondary database in sync with the primary database. I explained how it’s very rare for that to happen, but sometimes it does, and not in a way you’d think. Read on…

I don’t even have to ask you to read on; Paul has even done that. And do read the comments as well.

Comments closed

Preventing Query Timeouts with Power BI Incremental Refresh

Gilbert Quevauvilliers shows how to set the default timeout for a query against SQL Server from Power BI:

This was because on the first refresh it has to process all the data before it can incrementally refresh the dataset.

As per the documentation the default timeout for a SQL Server database is set to 10 minutes, and when I am processing a lot of data it can easily take longer than 10 minutes to return all the data.

Read on to see how you can change that if you need to.

Comments closed

Goodbye, MCSE

John Deardurff helps break the news:

Major Announcement from Microsoft Learning today. As Microsoft continues to invest in role-based learning offerings, the Microsoft Certified Solutions Associate (MCSA), Microsoft Certified Solutions Developer (MCSD), and Microsoft Certified Solutions Expert (MCSE) certifications will be phased out with a final retirement date of June 30th, 2020. Find the entire list of retired certifications here.

On the plus side, at least people who hold the next iteration of the MCSE won’t be confused with people who worked with NT4 anymore…

Comments closed

Loading Data into Delta Lake

Prakash Chockalingam takes us through auto-loading Delta Lake from various sources:

Auto Loader is an optimized file source that overcomes all the above limitations and provides a seamless way for data teams to load the raw data at low cost and latency with minimal DevOps effort. You just need to provide a source directory path and start a streaming job. The new structured streaming source, called “cloudFiles”, will automatically set up file notification services that subscribe file events from the input directory and process new files as they arrive, with the option of also processing existing files in that directory.

This does look interesting.

Comments closed

How Apache Beam Runs on Top of Apache Flink

Maximilian Michels and Markos Sfikas explain why you might want to combine Apache Beam with Apache Flink:

Apache Flink and Apache Beam are open-source frameworks for parallel, distributed data processing at scale. Unlike Flink, Beam does not come with a full-blown execution engine of its own but plugs into other execution engines, such as Apache Flink, Apache Spark, or Google Cloud Dataflow. In this blog post we discuss the reasons to use Flink together with Beam for your batch and stream processing needs. We also take a closer look at how Beam works with Flink to provide an idea of the technical aspects of running Beam pipelines with Flink. We hope you find some useful information on how and why the two frameworks can be utilized in combination. For more information, you can refer to the corresponding documentation on the Beam website or contact the community through the Beam mailing list.

Read on for the full story. If you’re so inclined, you can also check out the full talk as a video.

Comments closed

A Metadata-Driven Framework for ADF Pipelines

Paul Andrew has started a series on metadata-driven Azure Data Factory pipelines:

The concept of having a processing framework to manage our Data Platform solutions isn’t a new one. However, overtime changes in the technology we use means the way we now deliver this orchestration has to change as well, especially in Azure. On that basis and using my favourite Azure orchestration service; Azure Data Factory (ADF) I’ve created an alpha metadata driven framework that could be used to execute all our platform processes. Furthermore, at various community events I’ve talked about bootstrapping solutions with Azure Data Factory so now as a technical exercise I’ve rolled my own simple processing framework. Mainly, to understand how easily we can make it with the latest cloud tools and fully exploiting just how dynamic you can get a set of generational pipelines.

This first post lays out some of the architectural decisions and prep work needed for the series.

Comments closed

When the Optimizer Can Use Batch Mode on Row Store

Erik Darling looks at some internals for us:

Things like Accelerated Database RecoveryOptimize For Sequential Key, and In-Memory Tempdb Metadata are cool, but they’re server tuning. I love’em, but they’re more helpful for tuning an entire workload than a specific query.

The thing with BMOR is that it’s not just one thing. Getting Batch Mode also allows Adaptive Joins and Memory Grant Feedback to kick in.

But they’re all separate heuristics.

Read on to see the extended events around batch mode to help you determine if it’s possible for the optimizer to use it for a given query.

Comments closed