Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –
Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.
Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer Kryo documentation
There are some good tips in here.
Let’s start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We’ll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let’s say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables.
Click through for some examples of
rapply as well as
eapply. I’ve used
rapply to get cardinality of each feature in a data frame but the other two are new to me. H/T R-bloggers
Consider the following query execution scenario:
- You execute a SELECT query that triggers an automatic synchronous statistics update.
- The synchronous statistics update begins execution and your query waits (is essentially blocked) until the fresh statistics are generated.
- The query compilation and execution does not resume until the synchronous statistics update operation completes.
During this time, there are no external signs via common troubleshooting channels that the query is specifically waiting for the synchronous statistics update operation to complete. If the statistics update takes a long time (due to a large table and\or busy system), there is no easy way to determine root cause of the high duration.
This is now available in SQL Server 2019 CTP 2.1. Read the whole thing.
The way SQL Server enforces transaction durability, in part, is by ensuring that all of the transaction’s changes are written to the database’s transaction log on disk before returning control to the caller. In a case of a power failure after a transaction’s commit was acknowledged, you know that all those changes were at least written to the on-disk transaction log. That’s the case even if the related data pages were modified only in the data cache (the buffer pool) but not yet flushed to the data files on disk. When you restart SQL Server, during the redo phase of the recovery process, SQL Server uses the information recorded in the log to replay changes that were applied after the last checkpoint and that haven’t made it to the data files. There’s a bit more to the story depending on the recovery model that you’re using and on whether bulk operations were applied after the last checkpoint, but for the purposes of our discussion, suffice to focus on the part that involves hardening the changes to the transaction log.
The tricky part in SQL Server’s logging architecture is that log writes are sequential. Had SQL Server not used some sort of a log buffer to alleviate log writes to disk, write-intensive systems—especially ones that involve lots of small transactions—would quickly run into terrible log-write-related performance bottlenecks.
To alleviate the negative performance impact of frequent sequential log writes to disk, SQL Server uses a log buffer in memory.
Itzik also covers a few technologies which can help if you’re experiencing log buffer flush-related slowness, including enabling delayed durability and purchasing nonvolatile memory (NVDIMM-N) for storage.
Sometime in the next month, I’ll also be updating Start-DbaMigration to more closely match the parameters of Export-DbaInstance. Parameters like
NoLoginswill be replaced by
-Exclude Databases, Logins.
So the functionality won’t necessarily change, but if you have scheduled tasks or scripts that perform migrations, you will need to update your parameters once you update dbatools once these changes are made.
Keep an eye out for all of these changes if you’re a regular dbatools user or have processes scripted.
The ability to add columns and measures to display folders in the new Modelling view is one of my favourite features of the November 2018 release of Power BI Desktop: it makes complex models a lot more user-friendly. Being an old-school SSAS developer, I immediately wondered if they worked in the same way that they do in SSAS – and the answer is yes and no.
The good news is that display folders can be nested.
Click through for a demo as well as the issues Chris spotted.
In this blog post, we’re going to work on deploying a SQL Server Availability Group in a Kubernetes Cluster in on-premises virtual machines. I’m going to walk you through the process as it’s documented by Microsoft at this link here. This document is very good but only shows you how to do it in Azure, we’re going to do it in VMs. I’m going to follow Microsoft’s documentation as much as possible, deviating only when necessary for on-premises deployments. I’m also going to explain the key Kubernetes concepts that you need to know to understand how all these pieces together. This is a long one, buckle up.
He is not kidding about the length of the post.
For certain types of data, writing it to the data lake really is frequently the best choice. This is often true for low latency IoT data, semi-structured data like logs, and varying structures such as social media data. However, the handling of structured data which originates from a relational database is much less clear.
Most data lake technologies store data as files (like csv, json, or parquet). This means that when we extract relational data into a file stored in a data lake, we lose valuable metadata from the database such as data types, constraints, foreign keys, etc. I tend to say that we “de-relationalize” data when we write it to a file in the data lake. If we’re going to turn right around and load that data to a relational database destination, is it the right call to write it out to a file in the data lake as an intermediary step?
Click through for considerations on both sides of the fence.