Working With Skewed Data In Pig

Kevin Feasel



Dmitry Tolpeko explains how you can use the Weighted Range Partitioner in Apache Pig to work with highly skewed data:

The problem is that there are 3,000 map tasks are launched to read the daily data and there are 250 distinct event types, so the mappers will produce 3,000 * 250 = 750,000 files per day. That’s too much.

Reducing the number of map tasks is not always an option as it increases the execution time, and the number of files is still high (500 tasks will produce 125,000 files per day), while our goal is to have about 500-700 files daily for this single table.

A well-known solution for this problem is to force a reduce phase. But what if the data is highly skewed, for example, when there are few event types having hundreds of millions rows, while others just have a few million rows?

This is a combination of an interesting problem and an interesting solution.

Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming:

Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot of improvements over that period of time, but there were still various challenges, primarily because it’s a very low-level API.

As a solution to those challenges, Spark Structured Streaming was introduced in Spark 2.0 (and became stable in 2.2) as an extension built on top of Spark SQL. Because of that, it takes advantage of Spark SQL code and memory optimizations. Structured Streaming also gives very powerful abstractions like Dataset/DataFrame APIs as well as SQL. No more dealing with RDD directly!

For me, it’s DataFrames all day. But Yaroslav has a more nuanced answer which is worth reading. There are also a couple of good examples.

Dealing With Massive Database Counts

Edward Harrison describes some of the issues you run into when working on a SQL Server instance with thousands of databases:

One of the headaches we frequently have to deal with is the excessive amount of time it takes for SQL Servers with the discussed design to go through any planned patches. We have seen this while carrying out some planned maintenance for one of our customers and the work simply comprised failing over a SQL Server instance, from the active node, over onto the passive node so that the former active node could be patched. The passive node had had its SQL Server patched earlier in the day while the active node was still on the older patch level.

This SQL Server had 10,000 databases and all that we needed to do was to simply click the buttons to failover the SQL Server and then wait for it come online on the other node. In the end, this failover took nearly four hours to complete because upgrade scripts had to be applied to each of the 10,000 databases.

Even a couple thousand databases on an instance can slow things down considerably. Consider an operation which takes 10 milliseconds like laying out a database in the object explorer in SSMS. With 10K databases, that 10ms activity (note that I’m making up the number) would take 100 seconds. And these are often blocking tasks, meaning you’re sitting there spinning your wheels for 100 seconds before you can do the thing you wanted to do. The pain is real. And for server activities, most administrative functions (backups, CHECKDB, index rebuilding, etc.) is done one at a time, meaning you have a lot of time spent on this stuff.

Installing R From Powershell

Tomaz Kastrun shows us how to install R and RStudio via Powershell:

For the brevity of this post, I will only download couple of R packages from CRAN repository, but this list is indefinite.
There are ways many ways to retrieve the CRAN packages for particular R version using powershell. I will just demonstrate this by using Invoke-WebRequest cmdlet.
Pointing your cmdlet to URL:  where  list of all packages for this version is available. But first we need to extract the HTML tag where information is stored.

There’s quite a bit of code here, but the upside is that you get the ability to automate server installs.

Creating Benchmarks With WorkloadTools

Kevin Feasel



Gianluca Sartori announces a new tool:

Throughout my career, I had to go through the pain of benchmarking often enough to get fed up with all the existing tools and decide to code my own. The result of this endeavor is WorkloadTools: a collection of tools to collect, analyze and replay SQL Server workloads, on premises and in the cloud.

At the moment, the project includes 3 tools:
SqlWorkload – a command line tool to capture, replay and analyze a workload
ConvertWorkload – a command line tool to convert existing workloads (traces and extended events) to the format used by SqlWorkload
WorkloadViewer – a GUI tool to visualize and analyze workload data

Click through for the link to check it out.

Power Query SSIS Source Preview

Joost van Rossum is excited about a new SSIS data source in preview:

First you need to download and install the latest SSDT (Version 15.9.0) and since it is still in preview, you can only use it within SSDT or on a Azure Integration Runtime in Azure Data Factory.

If you drag the new Power Query Source to your Data Flow canvas and edit it, you can paste your Power Query script from Power BI (or Excel) in the query textbox. So there is not yet(?) an editor, but this should simplify a lot of tasks that could were previously could only be solved with .Net scripting.

Read on for more details. Because it’s a preview, there’s still a lot of work yet to do it, it seems. But on the bright side, it’s new functionality in Integration Services.

Power BI Architecture Diagram V4

Dustin Ryan has a new version of the Power BI Architecture Diagram:

First and most importantly, I updated the Power BI logo in the diagram to the latest version of the logo!

Secondly, I included Power BI Dataflows in the diagram tagged #6. Power BI Dataflows are used to ingest, transform, integrate, and enrich big data by defining data source connections, ETL logic, refresh schedules, and more. Data is stored as entities in the Common Data Model in Azure Data Lake Storage Gen2. Dataflow entities can be consumed as a data source in Power BI and by using Power BI Desktop. Read more about Dataflows here.

Click through for a full changelog and a link to download the architecture diagram and legend.

Reference Column Names

Kevin Feasel



Jon Shaulis shows why you want to reference tables when including column names in queries:

If you don’t read the rest of this setup, I want you to take away one thing. 

Always reference your tables with your columns when more than one table is involved in the query!

This post is made primarily with SQL Server in mind, but this behavior is actually ANSI SQL and can be replicated in PostgreSQL, MySQL, and Oracle.

Jon’s example is a case where perfectly valid ANSI SQL logic (which is why you can replicate this across platforms and why it’s not a bug) leads to an unexpected result.


February 2019
« Jan Mar »