Press "Enter" to skip to content

Day: September 3, 2020

Join Operations in Spark

Swantika Gupta compares hash and merge join operations in Apache Spark:

One of the most frequently used transformations in Apache Spark is Join operation. Joins in Apache Spark allow the developer to combine two or more data frames based on certain (sortable) keys. The syntax for writing a join operation is simple but some times what goes on behind the curtain is lost. Internally, for Joins Apache Spark proposes a couple of Algorithms and then chooses one of them. Not knowing what these internal algorithms are, and which one does spark choose might make a simple Join operation expensive.

While opting for a Join Algorithm, Spark looks at the size of the data frames involved. It considers the Join type and condition specified, and hint (if any) to finally decide upon the algorithm to use. In most of the cases, Sort Merge join and Shuffle Hash join are the two major power horses that drive the Spark SQL joins. But if spark finds the size of one of the data frames less than a certain threshold, Spark puts up Broadcast Join as it’s top contender.

Click through for the comparison, though do note that this comparison doesn’t include nested loop joins, which are possible in Spark as well.

Comments closed

Automatic Retention Periods with Power Query

Gilbert Quevauvilliers has an easy method of constraining data sizes in Power Query:

How cool would it be to not have to manually update your dataset to keep data for last 2 Years (last year and this year to the current date)?

In this blog post I will show you how you can easily filter dates in Power Query to show dates for last year and year to date by using the GUI and not having to hardcode anything.

I must admit the first time I saw this was when I watching my good friend Reid Havens presenting. Thanks Reid!

Read on for the technique.

Comments closed

Data Importation and Exportation with dbatools

Mikey Bronowski continues a series on dbatools functionality vis-a-vis SQL Server Management Studio:

The SSMS offers to script out lots of the SQL Server objects, however it can be limited in some areas. Using Get-Dba* commands and piping them into Export-DbaScript may add few more options. For example SQL Agent jobs:

Click through for just shy of a dozen cmdlets to help you run your data import-export business.

Comments closed

DATETIME2 and Storage Size

Randolph West digs into an issue:

Two years ago I wrote a post that got a lot of traction in the comments at the time. Last month there was renewed interest because one of the commenters noted that the official SQL Server documentation for DATETIME2 disagreed with my assertions, and that I was under-representing the storage requirements.

To remind you, I have been saying for years that you can use DATETIME2(3) as a drop-in replacement for DATETIME, and have better granularity (1ms versus 3ms) for 12.5% less storage (1 byte per column per row). The commenter intimated that because my statement conflicted with the documentation, that I must be wrong. As it turns out the documentation was wrong, but I also learned something new in the process!

It’s an interesting internal look at how difficult it is to get documentation right, even on something which sounds simple.

Comments closed

Restoring the Master Database

Kenneth Igiri walks us through restoring the master database in SQL Server:

The master database contains records of the structure/configuration for both the current instance and all other databases. When you run sp_configure, you are writing data to the master database.  It also contains most of the dynamic management views that are necessary to monitor the instance.

The importance of the master database is crucial. First, it has the information necessary for opening all other databases and has to be opened first. Then, it involves all instance level principals for the current instance.

It is crucial to back up the master database daily. Equally important is to know how to restore the master database to the instance. The most frequent cases are the database crash or the necessity to restore the master database to another instance when no longer use the source instance. In this article, we will examine the specific case of moving the master database to another instance. 

It’s definitely not as easy as restoring other databases, but it is possible.

Comments closed

Azure Data Studio Extension Generator

Anjali Agarwal and Laura Jiang announce a new product:

The release of the Azure Data Studio extension generator is now available. Install the generator through npm and get started with extension development with these Azure Data Studio extension tutorials.

The Azure Data Studio extension generator is a command line tool designed to help extension authors get started with the process of extension development. It includes extension templates that enable users to create and publish extensions with minimal technical knowledge required. In our most recent release, we have added three highly requested extension templates to the generator.

Anything which helps make extension development easier is fine by me.

Comments closed

An UPSERT Pattern to Avoid

Aaron Bertrand doesn’t like a common insert/update pattern:

I think everyone already knows my opinions about MERGE and why I stay away from it. But here’s another (anti-)pattern I see all over the place when people want to perform an upsert (update a row if it exists and insert it if it doesn’t):

IF EXISTS (SELECT 1 FROM dbo.t WHERE [key] = @key) BEGIN
UPDATE dbo.t SET val = @val WHERE [key] = @key;
END
ELSE
BEGIN
INSERT dbo.t([key], val) VALUES(@key, @val);
END

This looks like a pretty logical flow that reflects how we think about this in real life:

Does a row already exist for this key?
YES: OK, update that row.
NO: OK, then add it.

Click through to learn why this is a bad idea.

Comments closed