Press "Enter" to skip to content

Day: December 20, 2019

Sticky Partitioning in Kafka 2.4

Justine Olshan takes us through sticky partitioning in Kafka 2.4:

The sticky partitioner addresses the problem of spreading out records without keys into smaller batches by picking a single partition to send all non-keyed records. Once the batch at that partition is filled or otherwise completed, the sticky partitioner randomly chooses and “sticks” to a new partition. That way, over a larger period of time, records are about evenly distributed among all the partitions while getting the added benefit of larger batch sizes.

It looks like this is an improvement with few downside tradeoffs.

Comments closed

Fixing the Small File Problem in Hadoop

Guy Shilo takes us through the Hadoop Archive format:

It has hard time handling many small files. The memory footprint of the namenodes becomes high as they have to keep track of many small blocks and the performance of scans goes down.

The best way to fix this situation is, of course to avoid it in first place. This can be done when designing the application or the pipeline that inserts the data into HDFS, for example, by bundling many files into one container such as sequencefile, Avro or Hadoop archive (.har file).

Hadoop archive is somewhat overlooked option that I want to demonstrate today. You will see that it can be very useful in some cases but not so great in others.

Read the whole thing before giving it a try, as there are some downsides.

Comments closed

Azure Data Factory Templates and Source Control

Cathrine Wilhelmsen continues a series on Azure Data Factory. First up is source control:

And yeah, I usually recommend that you set up source control early in your project, and not on day 18… However, it does require some external configuration, and in this series I wanted to get through the Azure Data Factory basics first. But by now, you should know enough to decide whether or not to commit to Azure Data Factory as your data integration tool of choice.

Next up is using the template gallery:

You can also create custom templates and share them with your team – or share them externally with others. Custom templates are saved in your code repository and will show up in the template gallery for you and your team. If you want to share them externally, you can easily export them, so others can import them in their Azure Data Factory.

Let’s take a look!

Read on to learn more.

Comments closed

No-Longer-Necessary Trace Flags

Monica Rathbun points out some of the trace flags which are no longer important in SQL Server:

If you have ever attended one of my performance tuning sessions, you know I tend to talk about  trace flags.  Trace Flags can help fix performance issues and some are now defaulted in later SQL Server versions. In my opinion, when a trace flag’s behavior defaulted in a version, then you should potentially put them in place within environments that do not have them implemented. Below, are a few of these particular traces flag along with Microsoft’s definition of what each trace flag does, taken straight from MS documents.  I have also included a brief commentary on each one.  As with any change, you should be sure to thoroughly test before implementing these trace flags into any production environment.

Read the whole thing, especially because at least one of them is still optional and defaulted to off (but able to change at a different scope).

Comments closed

SQL Server Truncating Numbers to Asterisks

Bert Wagner points out that some numeric types handle overflow in a weird way:

Why does SQL Server sometimes error when converting a number into a string, but other times succeeds and returns an asterisk?

I don’t know.

The best (and logical) answer I could find online is from Robert Sheldon, who attributes it to poor error handling practices, “…before error handling got a more reputable foothold.”

This makes it important to check your results. I imagine that there’s somebody who relies upon this exact functionality, but it’s pretty weird.

Comments closed

The Fickleness of Batch Mode on Rowstore

Erik Darling points out how difficult it can sometimes be to get batch mode processing on rowstore tables:

I’m excited about this feature. I’m not being negative, here. I just want you, dear reader, to have reasonable expectations about it.

This isn’t a post about it making a query slower, but I do have some demos of that happening. I want to show you an example of it not kicking in when it probably should. I’m going to use an Extended Events session that I first read about on Dmitry Pilugin’s blog here. It’ll look something like this.

Read on for a demonstration of the point.

Comments closed

TRY_PARSE and NaN

Slava Murygin finds a nasty bug in SQL Server:

Database in trouble has a table with FLOAT column. It’s Front-End application verifies user’s input and inserts the data into that column using TRY_PARSE function.
The developer’s intention was that any “Not-a-Numeric” or “Out-of-Range” values will be automatically converted to NULL and it will be for user’s discretion to verify and fix these values.

However, one of the application users was very educated and instead of empty space, NULL or any other bad not numeric value the user supplied data with value of “NaN” for empty cells, which simply stands for “Not a Numeric”.
That action caused a database corruption!

Click through for a demo which you should not repeat on a work server.

Comments closed