Kafka Streams And Time-Based Batching

Vladimir Vajda provides a warning for people using Kafka Streams:

To completely understand the problem, we will first go into detail how ingestion and processing occur by default in Kafka Streams. For example purposes, the punctuate method is configured to occur every ten seconds, and in the input stream, we have exactly one message per second. The purpose of the job is to parse input messages, collect them, and, in the punctuate method, do a batch insert in the database, then to send metrics.

After running the Kafka Stream application, the Processor will be created, followed by the initmethod. Here is where all the connections are established. Upon successful start, the application will listen to input topic for incoming messages. It will remain idle until the first message arrives. When the first message arrives, the process method is called — this is where transformations occur and where the result is stored for later use. If no messages are in the input topic, the application will go idle again, waiting for the next message. After each successful process, the application checks if punctuate should be called. In our case, we will have ten process calls followed by one punctuate call, with this cycle repeating indefinitely as long as there are messages.

A pretty obvious behavior, isn’t it? Then why is one bolded?

Read on for more, including how to handle this edge case.

Kafka And GDPR

Kevin Feasel

2017-12-11

Hadoop

Ben Stopford has some ideas for using Kafka in a GDPR world:

The simplest way to remove messages from Kafka is to simply let them expire. By default, Kafka will keep data for two weeks, and you can tune this to arbitrarily large periods of time as required. There is also an Admin API that lets you delete messages explicitly if they are older than some specified time or offset. But what if we are keeping data in the log for a longer period of time, say for Event Sourcing architectures or as a source of truth? For this, you can make use of compacted topics, which allow messages to be explicitly deleted or replaced by key.

Data isn’t removed from compacted topics in the same way as in a relational database. Instead, Kafka uses a mechanism closer to those used by Cassandra and HBase where records are marked for removal then later deleted when the compaction process runs. Deleting a message from a compacted topic is as simple as writing a new message to the topic with the key you want to delete and a null value.  When compaction runs the message will be deleted forever.

Click through for more information.

DevOps And Culture

Kevin Feasel

2017-12-11

DevOps

Alex Yates points out a problem in how many people think of DevOps:

There is so much that is wrong with that conversation.

  • We could talk about the bottlenecks and the large amount of work in progress backed up in test – and the ways that could be fixed,

  • We could talk about the infrequent ‘big bang’ release in three months and the manual, error prone heroics that will probably be required to deliver it – and the ways that could be fixed,

  • We could talk about the misguided approach regarding branching strategies or the shared development database – and the ways they could be fixed,

  • We could talk about testing silos and the likelihood of drift and inconsistencies between different environments – and the ways they could be fixed,

  • We could talk about the word “DevOps-ing” – and why it should be burned along with anyone who uses it un-ironically. (And anyone who uses the word “irony” inappropriately or puts their commas at the end of the line.),

But I’m not going to talk about any of those things. I’m not going to talk about any of the things the customer said. I’m not going to talk about any of the technical issues or the possible solutions to those problems.

I’m going to talk about something much, much more important.

Read on to see what is much, much more important:  culture.

Troubleshooting Client Disconnections

Mike Hays looks at a trace flag that can help you troubleshoot why client connections drop:

About once a month, I get support ticket regarding SQL Server dropping an application’s or user’s connection.  The problem is SQL Server does not just randomly drop a connection and continue to work normally.  Some force outside the control of SQL Server breaks the connection.  By default, SQL Server does not record when this event occurs.

In my history of working with SQL Server, only in extreme situations have I ever seen SQL Server drop its connections.  The most common example is when SQL Server is in the process of shutting down.

Click through for the trace flag and details.

Collations Used In NVarchar To Varchar Conversions

Solomon Rutzky digs into collations:

So, the actual question is a bit more specific than would reasonably fit into a title, and it is:

In a WHERE condition (in Microsoft SQL Server, if that’s not obvious), when a string constant containing Unicode characters but not prefixed with a capital-N is compared to a column, which Collation (and hence Code Page) is used to do the implicit conversion of the Unicode characters into VARCHAR? Is it the Database’s default Collation, or the column’s Collation?

The answer has a lot more nuance than you might first think.  And for the record, it’s a duck.

Thinking About Scalable Persistent Memory

Ned Otter has a good post thinking about Scalable Persistent Memory:

There were other potential issues when using Persistent Memory, detailed in this blog post. But what’s not covered in that post is the fact that deploying NVIDMM-N reduced the memory speed and/or capacity, because they are not compatible with LRDIMM. This causes you to use RDIMM, which reduces capacity, and because NVDIMM-N operates at a slower speed than RDIMM, it also affects total memory speed.

HP has since released Gen10 servers, and they have changed the landscape for those seeking reduced latency by storing larger data sets in memory. For one thing, they raise the bar for what’s now referred to as Scalable Persistent Memory, with a total server capacity of 1TB. To be clear, NVDIMM-N is not used in this configuration. Instead, regular DIMMs are used, and they are persisted to flash via a power source (this was also the case for NVDIMM-N, but both the flash, DIMM, and power source were located on the NVDIMM-N).

Check it out.  I’m happy that things are improving, but it sounds like this won’t be a panacea.

When Data Factory Flows Don’t

Kevin Feasel

2017-12-11

Cloud, ETL

Emma Stewart points out an issue that might vex newcomers to Azure Data Factory:

The data within the Data Lake store was organised into a Year and Month hierarchy for the folders, and each days transactions were stored in a file which was named after the day within the relevant month folder. The task then was to create a pipeline which copies the dataset in the Data Lake Store over to the dbo.Orders table in Azure SQL DB every day within the scheduled period (Q1 2016).

After creating all the json scripts and deploying them (with no errors), I clicked on the ‘Monitor and Manage’ tile to monitor the activities, check everything was working as it should be and monitor the progress. After waiting for at least 10 minutes, I started to get frustrated.

Click through for the fix and an explanation.

Categories

December 2017
MTWTFSS
« Nov Jan »
 123
45678910
11121314151617
18192021222324
25262728293031