Spark At Scale

Kevin Feasel



Sital Kedia, Shuojie Wang, and Avery Ching have an example of how Facebook uses (and has improved) Spark for their ranking system:

Debugging at full scale can be slow, challenging, and resource intensive. We started off by converting the most resource intensive part of the Hive-based pipeline: stage two. We started with a sample of 50 GB of compressed input, then gradually scaled up to 300 GB, 1 TB, and then 20 TB. At each size increment, we resolved performance and stability issues, but experimenting with 20 TB is where we found our largest opportunity for improvement.

While running on 20 TB of input, we discovered that we were generating too many output files (each sized around 100 MB) due to the large number of tasks. Three out of 10 hours of job runtime were spent moving files from the staging directory to the final directory in HDFS. Initially, we considered two options: Either improve batch renaming in HDFS to support our use case, or configure Spark to generate fewer output files (difficult due to the large number of tasks — 70,000 — in this stage). We stepped back from the problem and considered a third alternative. Since the tmp_table2 table we generate in step two of the pipeline is temporary and used only to store the pipeline’s intermediate output, we were essentially compressing, serializing, and replicating three copies for a single read workload with terabytes of data. Instead, we went a step further: Remove the two temporary tables and combine all three Hive stages into a single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort.

Maybe it’s just a mindset thing, but the part that impressed me was the number of pull requests for system improvements (and the number which were accepted).

Deleting From Heaps

Uwe Ricken discusses a downside to leaving an active table as a heap:

The simplest way of removing empty data pages from a heap is via the exclusive locking of the table. But keep in mind that there is no such thing as a free lunch. Blocking the table from other processes will lead to a system that will not scale. As long as the DELETE process locks the table, no other process can access the table. The only exception is the “SNAPSHOT ISOLATION” but explaining the pros and cons of this technique is outside the scope of this article.

The next example (with the new created table) demonstrates the behavior.

The section on lock escalation was rather interesting.

Pattern Matching

Kevin Feasel



Aaron Bertrand has an article on performance testing various pattern matching techniques:

I wanted to prove to myself that the above procedure was undeniably better overall for all possible searches than any of the queries that use ORconditionals, regardless of whether they use LIKE, LEFT/RIGHT, SUBSTRING, or CHARINDEX. I took David’s basic query structures and put them in stored procedures (with the caveat that I can’t really test “contains” without his input, and that I had to make his OR logic a little more flexible to get the same number of rows), along with a version of my logic. I also planned to test the procedures with and without an index I would create on the search column, and under both a warm and a cold cache.

The conclusion is ambiguous.  Read on for the full story.

Big Play Alerts

Kevin Feasel



Allison Tharp has a Python script to track extremely important events:

First, we get the game data for the game we want.  In this instance, I am getting game data for the Indianapolis vs Cincinnati game in the 4th week of the 2016 preseason and setting it to the variable g.  Next, we will get the current number of scoring plays (scores0), number of home/away team turnovers (home/awayto0), number of home/away penalties (home/awaypenalty0), and finally, the number of yards that resulted from home/away penalties (home/awaypenyds0).

The rest of the script runs while the game is still in progress.  To check if the game is in progress, we use g.game_over().  If this object is False, the game is ongoing:

I did not know about the nflgame module and I think my life has just become better as a result of learning about this.


Ewald Cress discusses superlatch promotion:

There are quite a few pieces of machinery that are involved in our little drama. First, I’ll introduce some instance-global settings:

  • A flag that controls whether latch promotion is enabled at all. Although I don’t have any information about this, let’s assume that it will be enabled on any system that “warrants it”.
  • A flag that controls whether cycle-based promotion is enabled. Again, I can’t currently tell you what determines this setting.
  • sm_promotionThreshold, the current calculated cycle-based promotion threshold described in Part 3.
  • sm_promotionUpperBoundCpuTicks, used as a ceiling value to prevent outliers from skewing stats. As described in Part 3, this is simply sm_promotionThreshold * 5.
  • Trace flag 844, which lowers the threshold for non-cycle-based promotions.
  • Trace flag 827, which causes each latch promotion to be noted in the SQL Server log (“Latch promotion, page %u:%u in database %u, objid %u.”)

Assume that the first flag is set on our system of interest, otherwise promotions won’t happen and we have nothing to talk about.

Read the whole thing.

Powershell Prompts

Thomas Rushton wants to customize his Powershell prompt:

By default, your PowerShell prompt is “PS <<path>> >”. So normal; so dull. However, that can be easily changed. In my case, I want to add:

  • the current time (well, the time at which the prompt was generated)
  • a reminder which account I’m using (I have two – a “normal user” one, and a “DBA/SU” one
  • a countdown of seconds remaining until my next holiday

Oh, and I still need to see the current path.

Very interesting.  By contrast, here’s how to change a Bash prompt.

Biml Relational Hierarchy

Kevin Feasel



Andy Leonard introduces the Biml Relational Hierarchy:

There are two ways to populate the relational hierarchy:

  • Graphically
  • Programmatically

In this post, I am going to cover graphically populating the relational hierarchy. Scott Currie [Varigence] provided a remarkable example of programmatically populating the relational hierarchy in his article Biml Basics for Relational DBs.

Populating the relational hierarchy graphically is simple; just follow the steps as they are listed from top to bottom in the Logical View’s Relational node in the Mist IDE. Begin with Connections.

Following along does require a copy of Mist/BimlStudio but there’s a free trial.  If your company gets seriously into Biml, Mist is a great purchase.

Azure SQL Database Supports JSON

Kevin Feasel


Cloud, JSON

Jovan Popvic reports that Azure SQL Database now has full JSON support:

JSON is available in all service tiers (basic, standard, and premium) but only in new SQL Database V12. You can see quick  introduction here or more details in Getting Started page. you can also find code samples that JSON functions in Azure Sql Database on official Sql Server/Azure Sql Database GitHub repository.

Note that OPENJSON function requires database compatibility level 130. If all functions work except OPENJSON, you would need to set the latest compatibility level in database.

It will be interesting to see adoption of JSON within Azure SQL Database.  I could see it being a bit more likely due to DocumentDB.

XE Default Sessions

Jason Brimhall goes into the various Extended Events default sessions:

With the release of SQL Server 2016 there is another default session that gets installed. This session is probably the most difficult to find. This is a private hidden session when it is installed. In addition, it is not present by default with SQL 2016 RTM but it was present by default with various CTP versions. The session is specific to the new stretch feature and is called rdaxesession.

Breaking down this particular session reveals pretty easily what feature the session is designed to support. The name rdaxesession breaks down to the following: rda = remote data archive, and then the rest is self explanatory. If I investigate the metadata for the session I will discover that the following events are traced in the session:

Jason also has an earlier blog post about the default sessions I tend to use, like system health.

Date Dimension With DAX

Meagan Longoria shows how to create a date dimension using only DAX:

The fiscal calendar assumes calendar months with the fiscal calendar shifted by some number of months. It also assumes that the fiscal year is ahead of the calendar year. That is, fiscal year 2017 starts in the first day of some month in 2016. In my example script, the fiscal year starts in October. If you have some 4-4-5, 4-5-4, or other calendar, the fiscal calendar calculations in this script won’t help you.

To add this date dimension to your SSAS Tabular project:

  1. Right click on the Model.bim file in the Solution Explorer and choose View Code.

  2. If you have at least one table in the model already, locate the end of the definition of the previous table, inside the table array. Add a comma after the end curly brace for the previous table.

  3. Paste the JSON/TMSL from my Gist.

  4. Save and close the file.

  5. Right click on Model.Bim and choose View Designer.

Click through for the script and additional explanation.


September 2016
« Aug Oct »