Tuning Apache Spark Applications

Vidisha Gupta has a few tips for tuning Apache Spark programs:

Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –

  • Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.

  • Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer  Kryo documentation

There are some good tips in here.

The Lesser-Known Apply Functions In R

Kevin Feasel

2018-11-14

R

Andrew Treadway covers a few of the lesser-known apply functions in R:

rapply

Let’s start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We’ll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let’s say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables.

Click through for some examples of rapply as well as vapply and eapply.  I’ve used rapply to get cardinality of each feature in a data frame but the other two are new to me.  H/T R-bloggers

New Diagnostics For Synchronous Statistics Updates

Joe Sack announces a new wait type and request command:

Consider the following query execution scenario:

  • You execute a SELECT query that triggers an automatic synchronous statistics update.
  • The synchronous statistics update begins execution and your query waits (is essentially blocked) until the fresh statistics are generated.
  • The query compilation and execution does not resume until the synchronous statistics update operation completes.

During this time, there are no external signs via common troubleshooting channels that the query is specifically waiting for the synchronous statistics update operation to complete.  If the statistics update takes a long time (due to a large table and\or busy system), there is no easy way to determine root cause of the high duration.

This is now available in SQL Server 2019 CTP 2.1.  Read the whole thing.

Deep Dive On Log Buffer Flushes

Itzik Ben-Gan delves into log buffer flushes and how SQL Server maintains durability without giving up too much performance:

The way SQL Server enforces transaction durability, in part, is by ensuring that all of the transaction’s changes are written to the database’s transaction log on disk before returning control to the caller. In a case of a power failure after a transaction’s commit was acknowledged, you know that all those changes were at least written to the on-disk transaction log. That’s the case even if the related data pages were modified only in the data cache (the buffer pool) but not yet flushed to the data files on disk. When you restart SQL Server, during the redo phase of the recovery process, SQL Server uses the information recorded in the log to replay changes that were applied after the last checkpoint and that haven’t made it to the data files. There’s a bit more to the story depending on the recovery model that you’re using and on whether bulk operations were applied after the last checkpoint, but for the purposes of our discussion, suffice to focus on the part that involves hardening the changes to the transaction log.

The tricky part in SQL Server’s logging architecture is that log writes are sequential. Had SQL Server not used some sort of a log buffer to alleviate log writes to disk, write-intensive systems—especially ones that involve lots of small transactions—would quickly run into terrible log-write-related performance bottlenecks.

To alleviate the negative performance impact of frequent sequential log writes to disk, SQL Server uses a log buffer in memory.

Itzik also covers a few technologies which can help if you’re experiencing log buffer flush-related slowness, including enabling delayed durability and purchasing nonvolatile memory (NVDIMM-N) for storage.

Breaking Changes Coming To dbatools

Chrissy LeMaire warns us about breaking changes coming to dbatools with release 1.0:

Sometime in the next month, I’ll also be updating Start-DbaMigration to more closely match the parameters of Export-DbaInstance. Parameters like NoDatabases and NoLogins will be replaced by -Exclude Databases, Logins.

So the functionality won’t necessarily change, but if you have scheduled tasks or scripts that perform migrations, you will need to update your parameters once you update dbatools once these changes are made.

Keep an eye out for all of these changes if you’re a regular dbatools user or have processes scripted.

Display Folder Nesting In Power BI

Chris Webb shows off a new feature to Power BI Desktop and notes a couple of issues with it:

The ability to add columns and measures to display folders in the new Modelling view is one of my favourite features of the November 2018 release of Power BI Desktop: it makes complex models a lot more user-friendly. Being an old-school SSAS developer, I immediately wondered if they worked in the same way that they do in SSAS – and the answer is yes and no.

The good news is that display folders can be nested.

Click through for a demo as well as the issues Chris spotted.

Deploying SQL Server Availability Groups Via Kubernetes

Anthony Nocentino continues his series on Kubernetes:

In this blog post, we’re going to work on deploying a SQL Server Availability Group in a Kubernetes Cluster in on-premises virtual machines. I’m going to walk you through the process as it’s documented by Microsoft at this link here. This document is very good but only shows you how to do it in Azure, we’re going to do it in VMs. I’m going to follow Microsoft’s documentation as much as possible, deviating only when necessary for on-premises deployments. I’m also going to explain the key Kubernetes concepts that you need to know to understand how all these pieces together. This is a long one, buckle up.

He is not kidding about the length of the post.

On Whether Relational Data Belongs In A Data Lake

Melissa Coates debates whether relational data really belongs in a data lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

Click through for considerations on both sides of the fence.

Categories

November 2018
MTWTFSS
« Oct Dec »
 1234
567891011
12131415161718
19202122232425
2627282930