2021-08-06 – Curated SQL

We like to save the best for last, but this tool is too good to wait. So, we’ll start off by covering kafkacat.
kafkacat is a fast and flexible command line Kafka producer, consumer, and more. Magnus Edenhill, the author of the librdkafka C/C++ library for Kafka, developed it. kafkacat is great for quickly producing and consuming data to and from a topic. In fact, the same command will do both, depending on the context. Check this out:

Read on for more information on this tool, as well as several others.

Comments closed

How Spark Determines Task Numbers and Parallelism

Published 2021-08-06 by Kevin Feasel

The Hadoop in Real World team explains how the Spark engine decides how many tasks to create for a job and how many can run in parallel:

In this post we will see how Spark decides the number of tasks and number of tasks to execute in parallel in a job.
Let’s see how Spark decides on the number of tasks with the below set of instructions.
[… instructions]
Let’s also assume dataset_Y has 10 partitions and dataset_Y has 5 partitions.

Click through for the full explanation.

Comments closed

Shrinking Convolutional Neural Networks for TinyML

Published 2021-08-06 by Kevin Feasel

Pete Warden writes up a tip:

A colleague recently asked for more details on an approach I recommended, but which she hadn’t seen any documentation for. I realized that it was something I’d learned from talking to model builders at Google, and I wasn’t sure there was anything written up, so in the spirit of leaving a trail of breadcrumbs for anyone coming after, I thought I should put it into a quick blog post.
The summary is that if you have MaxPool or AveragePool after a convolutional layer in a network, and you’re targeting a resource-constrained system like a microcontroller, you should try removing them entirely and replacing them with a stride in the convolution instead. This has two main benefits, but to explain it’s easiest to diagram out the network before and after.

Click through for the full explanation.

Comments closed

Performance Tips when Working with Large Datasets in R

Published 2021-08-06 by Kevin Feasel

Mira Celine Klein continues a series on performance tuning R code:

Whether your dataset is “large” not only depends on the number of rows, but also on the method you are going to use. It’s easy to compute the mean or sum of as many as 10,000 numbers, but a nonlinear regression with many variables can already take some time with a sample size of 1,000.
Sometimes it may help to parallelize (see part 3 of the series). But with large datasets, you can use parallelization only up to the point where working memory becomes the limiting factor. In addition, there may be tasks that cannot be parallelized at all. In these cases, the strategies from part 2 of this series may be helpful, and there are some more ways:

Click through for four options.

Comments closed

Diving into Data Mesh

Published 2021-08-06 by Kevin Feasel

Ust Oldfield has a series on data mesh. First up is a primer on the topic:

What is a Data Mesh? Data Mesh is new and was introduced in a few articles by Zhamak Dehghani, starting in May 2019. The first article, How To Move Beyond a Monolithic Data Lake to a Distributed and Data Mesh, and the second article, Data Mesh Principles and Logical Architecture, form the foundational thought pieces on data meshes. Very simply, and crudely, a Data Mesh aims to overcome the deficiencies of previous generations of analytical architectures by decentralising the ownership and production of analytical data to the teams who own the data domain. It is a convergence of Distributed Domain Driven Architecture, Self-serve Platform Design, and Product Thinking with Data.
Essentially applying the learnings from operational systems, of applying domain driven design, so that ownership of data is domain oriented. For example, a CRM team will own the Customer domain and all the data within it. Instead of data from all domains flowing into a central data repository, the individual domains curate and serve their datasets in an easily consumable way. A much closer alignment between operational and analytical data.

Then Ust has a deep dive:

A foundational principle of data mesh is the decentralisation of ownership of data to those closest to it. Those closest to it are those in the business, often using the operational systems and also using analytical data. Responsibility and ownership of the data is devolved from a central function to business units and domains. Therefore any change to how a business domain organises itself is limited to the domain itself rather than impacting the entire organisation. This is referred to as the bounded domain context.
Teams responsible for the business operation, e.g. CRM team responsible for Customers, are also responsible for the ownership and serving of the analytical data relating to their domain.

I probably need to spend more time thinking about it, but this feels like a rickety ladder of incentive compatibility problems. I agree with Ust’s characterization of warehouses as staid and that relatively little has happened since the Kimball model came to prominence, but this sounds like taking the problems of the Kimball model (slow to change, an iceberg of code & ETL, etc.) and effectively saying “Hey, here are slightly different views of the source systems, have fun” instead. That way, everybody gets to experience the joy of non-conforming dimensions, disagreeing facts, working with data in different grains, and seven terms for the same thing.

Comments closed

The Basics of Table Variables

Published 2021-08-06 by Kevin Feasel

Jared Poche explains some of the value of table variables:

In having a talk reviewed recently, it was suggested I spend more time defining some of the subject I touched on. It occurred if I should go over (or at least introduce) these ideas during a talk for a SQL Saturday audience, some might find a post on the subject useful. Hence my recent post on key lookups.
Another such topic is table variables. I use table variables frequently at my current job, but they came up very infrequently when I worked at CSS in Microsoft. I remember the conversations about them being very simple at the time, as in, “you should just use temp tables instead.” But there is a lot of utility with table variables, and they could be a useful arrow in your quiver.

Click through for the full story.

Comments closed

Removing Data from Query Store

Published 2021-08-06 by Kevin Feasel

Erin Stellato shows us how to do some spring cleaning:

Automatic removal of data from Query Store is based on configuration, notably the MAX_STORAGE_SIZE_MB, CLEANUP_POLICY (STALE_QUERY_THRESHOLD_DAYS), and SIZE_BASED_CLEANUP_MODE settings. While these are detailed in another post, it’s worth reiterating that ideally, size-based cleanup never kicks in. A database’s Query Store should be sized to accommodate N days’ worth of data, where N is set by CLEANUP_POLICY (STALE_QUERY_THRESHOLD_DAYS). When the size of the Query Store exceeds 90% of MAX_STORAGE_SIZE_MB, and if SIZE_BASED_CLEANUP_MODE is set to AUTO, then clean up will kick in. This algorithm is not efficient, and it runs single-threaded. It looks for queries that are infrequently executed/less important and deletes those, one by one, until the size is less than 80% of MAX_STORAGE_SIZE_MB. Avoid this type of cleanup if at all possible.

There is a better way. And an even better way thanks to some of my colleagues.

Comments closed

Indexes and Sorts

Published 2021-08-06 by Kevin Feasel

Chad Callihan reminds us that sort order can matter for indexes:

When you’re working on an index for a query ordering by one column in ascending order and another column in descending order, do you have your index created to match? Did you know you can specify ASC or DESC in an index? Let’s go through a scenario where ordering in an index makes a noticeable difference.

This is particularly important for window functions—the optimizer can sometimes be smart enough to recognize that a value is in reverse order and not need to use a sort operator, but as soon as you drop that OVER() clause in, if things aren’t in the exact order they need, you get a sort operator thrown in for free. Or, well, the “your query is now a little bit slower” version of free.

Comments closed

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30	31

Day: August 6, 2021

Helpful Tools for Apache Kafka Developers

How Spark Determines Task Numbers and Parallelism

Shrinking Convolutional Neural Networks for TinyML

Performance Tips when Working with Large Datasets in R

Diving into Data Mesh

The Basics of Table Variables

Removing Data from Query Store

Indexes and Sorts