Press "Enter" to skip to content

Month: December 2019

Performance Tuning Load of Partitioned Hive Tables on S3 with Spark

Dmitry Tolpeko walks us through a performance problem in Spark:

I have a Spark job that transforms incoming data from compressed text files into Parquet format and loads them into a daily partition of a Hive table. This is a typical job in a data lake, it is quite simple but in my case it was very slow.

Initially it took about 4 hours to convert ~2,100 input .gz files (~1.9 TB of data) into Parquet, while the actual Spark job took just 38 minutes to run and the remaining time was spent on loading data into a Hive partition.

Let’s see what is the reason of such behavior and how we can improve the performance.

Read on to see how.

Comments closed

Repartitioning and Coalescing in Spark

Divyansh Jain contrasts repartitioning and coalescing in Spark:

What is Coalesce?

The coalesce method reduces the number of partitions in a DataFrame. Coalesce avoids full shuffle, instead of creating new partitions, it shuffles the data using Hash Partitioner (Default), and adjusts into existing partitions, this means it can only decrease the number of partitions.

What is Repartitioning?

The repartition method can be used to either increase or decrease the number of partitions in a DataFrame. Repartition is a full Shuffle operation, whole data is taken out from existing partitions and equally distributed into newly formed partitions.

Read on to learn good reasons to use both.

Comments closed

Azure Data Factory Notifications

Rayis Imayev walks us through three different techniques for sending notifications in Azure Data Factory:

While working on data integration projects and using Azure Data Factory as your main orchestration tool will help you to develop strategic forward thinking about your development tasks: to ponder on what your data sources are, point of destinations to land this information into a new data model and transformation steps to shape data from the source to its destination. Just like when you play chess and have to plan ahead several of your next moves.

Along with this structural thinking to develop and execute your data flows, timely notifications of when something goes left or right would give you additional peace of mind.

Something I appreciate in this post is that Rayis contrasts the Azure Data Factory techniques with SSIS methods, giving you a solid base for comparison.

Comments closed

Columnstore Indexes in Azure SQL Database

Niko Neugebauer takes us through the columnstore offerings available in Azure SQL Database:

Almost 2 years ago (22nd of March 2018) in Columnstore Indexes – part 121 (“Columnstore Indexes on Standard Tier of Azure SQL DB”) I have already mentioned that Columnstore Indexes were available in Azure SQL Database in Standard 3 (S3) edition and higher, while people I meet keep on mentioning and believing that in order to get Columnstore Indexes one needs to use Premium editions.

Since that blog post a lot of time has passed and in the mean time we have got new tiers with new generations of provisioned General Purpose tiers (Generation 4, Generation 5, FSv2 Series & M Series) appearing, plus the Serverless Tier and not to forget the very promising Hyperscale tier … besides the Azure SQL Database Managed Instance of course, which has already been generally available for some time and the good old Elastic Pools which were never mentioned in original article.

It sounds like, on the whole, columnstore is a normal part of Azure SQL Database across the board—it’s not a special add-on feature.

Comments closed

Things to Stop Doing

Tom LaRock has a list of life-altering recommendations:

RESPONDING IMMEDIATELY

Stop doing that. Trust me on this, your response to that email/text/slack can wait. Don’t believe me? Try an experiment. For one month do not reply immediately to your emails. At the end of the month add up the number of emails you received, and the number of emails that required an immediate response. I’m willing to bet that the number is quite low, much lower than you realize. And once you realize just how few require an immediate reply, you’ll never look at email in the same way again.

It’s a fun list, and I definitely agree in most circumstances on at least 3/4 of them.

Comments closed

Short Substrings and Computed Columns

Erik Darling gives us a story about computed columns that turns out not to be about computed columns at all (having thereby subverted our expectations):

The problem is that when I tried to index it:

CREATE INDEX dummy
    ON dbo.Users(DisplayNameComputed);

I got this error:

Msg 537, Level 16, State 3, Line 21
Invalid length parameter passed to the LEFT or SUBSTRING function.

And when I tried to select data from the table, the same error.

Click through to find the real query killer.

Comments closed

Power BI Year in Review

Kasper de Jonge walks us through the biggest changes to Power BI in 2019:

As the end of the year closes I was reminiscing on what a huge year it has been for Power BI. I work mostly with large organisations so my view will be slightly skewed towards that.

For me 2019 has been the year where Power BI got massive adoption as the standard BI platform in an organisation, it went from self serve to also contain corporate BI. 

The Power BI development model does have its downsides, but without a doubt, they get stuff done.

Comments closed

Diagnosing PolyBase Errors

Niels Berglund takes us through an odd “incorrect syntax” error with PolyBase:

What we see in Figure 1incorrect syntax exception, is strange, as I have executed the same code in a SQL Server 2019 Big Data Cluster, (BDC), without any issues, and the forum poster executed the same in SQL Server 2019 Enterprise Edition also without any issues.

Ok, but what about creating an external table against a relational data source – where we do not need to define an external file format?

There is a straightforward answer as to why the specific error message pops up, but I agree with Niels that it’d be nice to have “here’s the problem and here’s the solution” types of error messages. The deeper you get into the product—especially the older Hadoop external data source—the worse the error messages get.

Comments closed

Finding the Largest Profit or Loss with R

David Robinson takes us through a brainteaser:

I recently came across an interview problem from A Cool SQL Problem: Avoiding For-Loops . Avoiding loops is a topic I always enjoy reading about, and the blog post didn’t disappoint. I’ll quote that post’s description of the problem:

You have a table of trading days (with no gaps) and close prices for a stock.

Find the highest and lowest profits (or losses) you could have made if you bought the stock at one close price and sold it at another close price, i.e. a total of exactly two transactions.

You cannot sell a stock before it has been purchased. Your solution can allow buying and selling on the same trading_date (i.e. profit or loss of $0 is always, by definition, an available option); however, for some bonus points, you may write a more general solution for this problem that requires you to hold the stock for at least N days.

The SQL solution is another scenario where window functions save the day. In R, there are a couple straightforward options (if you happen to know about the functions!) which have radically different performance profiles.

Comments closed

Mapping in R with mapply

Andrew Treadway shows how to use the Map() and mapply() functions in R:

An older post on this blog talked about several alternative base apply functions. This post will talk about how to apply a function across multiple vectors or lists with Map and mapply in R. These functions are generalizations of sapply and lapply, which allow you to more easily loop over multiple vectors or lists simultaneously.

The idea of Map in functional programming takes a bit of time to really wrap your head around, but once you do, it becomes extremely powerful. H/T R-bloggers

Comments closed