Press "Enter" to skip to content

Month: November 2018

Tuning Apache Spark Applications

Vidisha Gupta has a few tips for tuning Apache Spark programs:

Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –

  • Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.

  • Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer  Kryo documentation

There are some good tips in here.

Comments closed

The Lesser-Known Apply Functions In R

Andrew Treadway covers a few of the lesser-known apply functions in R:

rapply

Let’s start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We’ll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let’s say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables.

Click through for some examples of rapply as well as vapply and eapply.  I’ve used rapply to get cardinality of each feature in a data frame but the other two are new to me.  H/T R-bloggers

Comments closed

New Diagnostics For Synchronous Statistics Updates

Joe Sack announces a new wait type and request command:

Consider the following query execution scenario:

  • You execute a SELECT query that triggers an automatic synchronous statistics update.
  • The synchronous statistics update begins execution and your query waits (is essentially blocked) until the fresh statistics are generated.
  • The query compilation and execution does not resume until the synchronous statistics update operation completes.

During this time, there are no external signs via common troubleshooting channels that the query is specifically waiting for the synchronous statistics update operation to complete.  If the statistics update takes a long time (due to a large table and\or busy system), there is no easy way to determine root cause of the high duration.

This is now available in SQL Server 2019 CTP 2.1.  Read the whole thing.

Comments closed

Deep Dive On Log Buffer Flushes

Itzik Ben-Gan delves into log buffer flushes and how SQL Server maintains durability without giving up too much performance:

The way SQL Server enforces transaction durability, in part, is by ensuring that all of the transaction’s changes are written to the database’s transaction log on disk before returning control to the caller. In a case of a power failure after a transaction’s commit was acknowledged, you know that all those changes were at least written to the on-disk transaction log. That’s the case even if the related data pages were modified only in the data cache (the buffer pool) but not yet flushed to the data files on disk. When you restart SQL Server, during the redo phase of the recovery process, SQL Server uses the information recorded in the log to replay changes that were applied after the last checkpoint and that haven’t made it to the data files. There’s a bit more to the story depending on the recovery model that you’re using and on whether bulk operations were applied after the last checkpoint, but for the purposes of our discussion, suffice to focus on the part that involves hardening the changes to the transaction log.

The tricky part in SQL Server’s logging architecture is that log writes are sequential. Had SQL Server not used some sort of a log buffer to alleviate log writes to disk, write-intensive systems—especially ones that involve lots of small transactions—would quickly run into terrible log-write-related performance bottlenecks.

To alleviate the negative performance impact of frequent sequential log writes to disk, SQL Server uses a log buffer in memory.

Itzik also covers a few technologies which can help if you’re experiencing log buffer flush-related slowness, including enabling delayed durability and purchasing nonvolatile memory (NVDIMM-N) for storage.

Comments closed

Breaking Changes Coming To dbatools

Chrissy LeMaire warns us about breaking changes coming to dbatools with release 1.0:

Sometime in the next month, I’ll also be updating Start-DbaMigration to more closely match the parameters of Export-DbaInstance. Parameters like NoDatabases and NoLogins will be replaced by -Exclude Databases, Logins.

So the functionality won’t necessarily change, but if you have scheduled tasks or scripts that perform migrations, you will need to update your parameters once you update dbatools once these changes are made.

Keep an eye out for all of these changes if you’re a regular dbatools user or have processes scripted.

Comments closed

Display Folder Nesting In Power BI

Chris Webb shows off a new feature to Power BI Desktop and notes a couple of issues with it:

The ability to add columns and measures to display folders in the new Modelling view is one of my favourite features of the November 2018 release of Power BI Desktop: it makes complex models a lot more user-friendly. Being an old-school SSAS developer, I immediately wondered if they worked in the same way that they do in SSAS – and the answer is yes and no.

The good news is that display folders can be nested.

Click through for a demo as well as the issues Chris spotted.

Comments closed

Deploying SQL Server Availability Groups Via Kubernetes

Anthony Nocentino continues his series on Kubernetes:

In this blog post, we’re going to work on deploying a SQL Server Availability Group in a Kubernetes Cluster in on-premises virtual machines. I’m going to walk you through the process as it’s documented by Microsoft at this link here. This document is very good but only shows you how to do it in Azure, we’re going to do it in VMs. I’m going to follow Microsoft’s documentation as much as possible, deviating only when necessary for on-premises deployments. I’m also going to explain the key Kubernetes concepts that you need to know to understand how all these pieces together. This is a long one, buckle up.

He is not kidding about the length of the post.

Comments closed

On Whether Relational Data Belongs In A Data Lake

Melissa Coates debates whether relational data really belongs in a data lake:

For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.

Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?

Click through for considerations on both sides of the fence.

Comments closed

Bias Correction In Standard Deviation Estimates

John Mount explains how to perform bias correction and explains why it happens so rarely in practice:

The bias in question is falling off at a rate of 1/n (where n is our sample size). So the bias issue loses what little gravity it ever may have ever had when working with big data. Most sources of noise will be falling off at a slower rate of 1/sqrt(n), so it is unlikely this bias is going to be the worst feature of your sample.

But let’s pretend the sample size correction indeed is an important point for a while.

Under the “no bias allowed” rubric: if it is so vitally important to bias-correct the variance estimate, would it not be equally critical to correct the standard deviation estimate?

The practical answer seems to be: no. The straightforward standard deviation estimate itself is biased (it has to be, as a consequence of Jensen’s inequality). And pretty much nobody cares, corrects it, or teaches how to correct it, as it just isn’t worth the trouble.

This is a good explanation of the topic as well as the reason people make these corrections so rarely.

Comments closed

Kerberos Authentication In Apache Cassandra

Justin Cameron announces an open source Kerberos authenticator in Apache Cassandra:

In conjunction with the Cassandra authenticator, we have also published an open-source Kerberos authenticator plugin for the Cassandra Java driver.

The plugin supports multiple Kerberos quality of protection (QOP) levels, which may be specified directly when configuring the authenticator. The driver’s QOP level must match the QOP level configured for the server authenticator, and is only used during the authentication exchange. If confidentiality and/or integrity protection is required for all traffic between the client and Cassandra, it is recommended that Cassandra’s built-in SSL/TLS be used (note that TLS also protects the Kerberos authentication exchange, when enabled).

An (optional) SASL authorization ID is also supported. If provided, it specifies a Cassandra role that will be assumed once the Kerberos client principal has authenticated, provided the Cassandra user represented by the client principal has been granted permission to assume the role. Access to other roles may be granted using the GRANT ROLE CQL statement.

Click through for more details and check out the GitHub repo.

Comments closed