Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include:
- Connectors – the JAR files that define how to integrate with the data store itself
- Converters – handling serialization and deserialization of data
- Transforms – optional in-flight manipulation of messages
One of the more frequent sources of mistakes and misunderstanding around Kafka Connect involves the serialization of data, which Kafka Connect handles using converters. Let’s take a good look at how these work, and illustrate some of the common issues encountered.
Read on for a good overview of the topic.
Finally I drag in the Values column and the Hierarchy in a matrix (I also turned on the +/- icons so we can expand collapse which is another November feature). The first thing we see is that the matrix looks weird with blanks and all.
The reason we are seeing this is this is how the data is set up, we have stored totals and grand totals at the lowest level in the hierarchy. Instead we want to show to them at the level where they are correct. Also we don’t want to show the aggregation created by the SUM.
Click through for the full example.
Note the :CONNECT command is used to connect to another server.
Because everything else works pretty much the same, and you get a whole lot of additional options, you might choose to open all your new queries in SQLCMD mode. That’s easy to do.
SQLCMD mode is one of those things where I thought I’d use it a lot, but aside from deploying database projects, I don’t. Granted, this could be a failure of imagination on my part.
I started pondering it for a bit and began to wonder if I could use the new for SQL Server 2016 STRING_SPLIT function to convert a binary string to decimal. The thought process was to split the string into rows of CHAR(1) values, along with an in-string character position. Then I could take the “1” values, calculate the 2ᵡ value, and sum them up for a decimal value.
I quickly realized I’d need to reverse the binary string. I also discovered the STRING_SPLIT function eliminated 0’s adjacent to 1’s. I don’t have much use for this code at the moment, but maybe there’s something here that you can take away.
Given some of the issues Dave ran into, it seems that a tally table-based solution might be a better choice.
I have a client that used Itzik Ben-Gan’s solution of creating a filtered nonclustered columnstore index to achieve batch mode on a rowstore (in fact I proposed that the client consider it). They have an OLTP system, and often perform YTD calculations. When they tested, processing time was reduced by 30 to 50 percent, without touching a single line of application code. If that ain’t low hanging fruit, I don’t know what is —
However, during testing, I noticed some intermittent blocking that didn’t make sense to me. But I couldn’t nail it down, and they went live with the “filtered nonclustered columnstore index” solution.
Once they deployed – and there was a lot of concurrency – I could see what had eluded me during my proof of concept: blocking in tempdb.
Read on for the repro and check out Ned’s UserVoice bug report.
The concept of workload management is a key factor for Azure SQL DW as there is only limited concurrency slots available and depending on the resource class, these slots can fill up pretty quickly. Once the concurrency slots are full, queries are queued until a sufficiently sized slot is opened up. Let’s recap what Resource Classes are and how they affect workload management.
A Resource Class is a pre-configured database role that determines how much resource is allocated to queries coming from users that belong to that role. For example, an ETL service account may use a “large” resource class and be allocated a generous amount of the server, however an analyst may use a “small” resource class and therefore only use up a small amount of the server with their queries. There are actually 2 types of resource class, Dynamic and Static. The Dynamic resource classes will grant a set percentage of memory to a query and actual value of this percentage will vary as the Warehouse scales up and down. The key factor is that an xLargeRc (extra-large resource class) will always take up 70% of the Server and will not allow any other queries to be run concurrently. No matter how much you scale up the Warehouse, queries run with an xLargeRc will run one at a time. Conversely, queries run with a smallrc will only be allocated 4% of the Server and therefore as a Warehouse scales up, this 4% becomes a larger amount of resource and can therefore process data quicker.
This looks like a useful addition. Click through for a few examples of how it will work.
As with similar functions,
DATEADDcan do arithmetic on dates as well as times. The syntax is straightforward:
DATEADD (datepart, number, date)
numberportion must be an integer, and it must be within the acceptable range of values for the date part.
Click through for a few examples.
Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries –
Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and creates large serialized formats for many classes.
Kryo Serialization: Spark can use Kryo library to serialize objects. It is much faster and compact but does not support all serializable types. So we must register those classes which we want to be serialized. Therefore, Kryo uses indices instead of full class names to identify data types which reduce the size of the serialized data thereby increasing performance. We can initialize our spark conf by setting the value of the property spark.serializer to org.apache.spark.serializer.KryoSerializer. This serializer has a major impact on performance when we are shuffling or caching a large amount of data. To know more about this serializer, refer Kryo documentation
There are some good tips in here.
Let’s start with rapply. This function has a couple of different purposes. One is to recursively apply a function to a list. We’ll get to that in a moment. The other use of rapply is to a apply a function to only those elements in a list (or columns in a data frame) that belong to a specified class. For example, let’s say we have a data frame with a mix of categorical and numeric variables, but we want to evaluate a function only on the numeric variables.
Click through for some examples of
rapply as well as
eapply. I’ve used
rapply to get cardinality of each feature in a data frame but the other two are new to me. H/T R-bloggers
Consider the following query execution scenario:
- You execute a SELECT query that triggers an automatic synchronous statistics update.
- The synchronous statistics update begins execution and your query waits (is essentially blocked) until the fresh statistics are generated.
- The query compilation and execution does not resume until the synchronous statistics update operation completes.
During this time, there are no external signs via common troubleshooting channels that the query is specifically waiting for the synchronous statistics update operation to complete. If the statistics update takes a long time (due to a large table and\or busy system), there is no easy way to determine root cause of the high duration.
This is now available in SQL Server 2019 CTP 2.1. Read the whole thing.