Press "Enter" to skip to content

Author: Kevin Feasel

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers:

Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. In addition to Python support, there is typically support for other programming languages, including JavaScript for web integration and Java for platform integration—though oftentimes with fewer features and less maturity. No matter what other platforms are supported, chances are very high that your data scientists will build and train their analytic models with Python.

There is an impedance mismatch between model development using Python, its tool stack and a scalable, reliable data platform with low latency, high throughput, zero data loss and 24/7 availability requirements needed for data ingestion, preprocessing, model deployment and monitoring at scale. Python in practice is not the most well-known technology for these requirements. However, it is a great client for a data platform like Apache Kafka.

Click through for the full argument as well as where Kafka can help mitigate some of the issues.

Comments closed

Why Increase Cost Threshold For Parallelism

Randolph West explains why the default value of cost threshold for parallelism is too low at 5:

Unfortunately, the default setting of 5 means that queries are likely to hit that threshold more often than not on modern hardware, and the optimizer is forced to look at parallel plans unnecessarily. A side-effect to this is that queries running in parallel will block queries running on a single thread, so short-running queries will be delayed by long-running queries needing the same resources.

Therefore, if I am setting up a SQL Server instance from scratch, I will set this value to 50 by default, and monitor the performance counters. Once a query cost exceeds 50, then it can use all the cores up to the limit defined by the max degree of parallelism.

50 is a reasonable first stab at a default. I’ll be honest, though: I’m not sure I know exactly where to set the cost threshold for parallelism. I agree that 5 is too low and 50 is better, but I don’t have a great feel for when it should move up or down short of “everything’s going parallel in your database, so increase the value” or “nothing’s ever going parallel, so maybe decrease it a skosh.”

Comments closed

Power BI Violin Plots

Meagan Longoria shows off a violin plot custom visual in Power BI:

violin plot is a nifty chart that shows both distribution and density of data. It’s essentially a box plot with a density plot on each side. Box plots are a common way to show variation in data, but their limitation is that you can’t see frequency of values. In other words, you can see statistics such as min, max, median, mean, or quartiles, but you can’t see the individual values nor how often they occurred.

Read on for a review of the custom visual available for violin plots, including areas where it does well and where it falls short at present.

Comments closed

Triggers In Postgres

Ryan Booz explains how triggers work in Postgres from the standpoint of someone familiar with SQL Server:

They are implemented as Functions (Stored Procedures)
This caught me off guard at first. I’ve been working with and dealing with triggers in SQL Server since day 1. They are first-class citizens… objects that have their own code blocks and rules.

PostgreSQL approaches it differently. Any reusable code block, regardless of its true purpose is a Function of varying types. Triggers are no different. Therefore, you write the logic of your trigger in a Function and then call it by adding a trigger to the DML event of a table.

Click through for a few more tips on triggers.

Comments closed

Wait Stats In Azure Data Studio

Paul Randal takes the Wait Stats Report in Azure Data Studio for a spin:

Note that the x-axis is percentage of all waits, not wait count. You’ll see that PREEMPTIVE_OS_FLUSHFILEBUFFERS is the top wait on my Linux instance – that’s by design and I’ll blog about that next. I’ve also submitted a GitHub change to add that wait to the list of waits filtered out by script the extension uses.

Anyway, you can drill in to the details by clicking the ellipsis at the top-right of the graph and selecting ‘Show Details’. That’ll give all the waits and by selecting each one you can see the usual output from my waits script. To get more information on what each wait means, select the bottom cell, right-click on the URL to copy it, and paste into your favorite browser to go to my waits library. And of course, you can refresh the results via the ellipsis as well.

I like how Azure Data Studio is coming together as a full product. There’s a ways to go yet, but it’s getting there.

Comments closed

How’s My Database?

Daniel Janik has a Windows app for you:

There are actually about 40 things it checks for.

Current limitations are that queries with a cursor or temp table are not analyzed. There’s also a bug where the missing indexes and warnings appear on the wrong node/operator. Since the tool is using estimated plans at the moment, it may not be as accurate.

I’m planning on a few new features in the next month to add feeding the utility a query plan and displaying the original query. I’m also planning on adding history and the ability to execute a query from the tool. Before we get to those we need to fix some known bugs though. I’m hoping that you. Yes! you can help me identify other bugs to make this a great tool for the SQL community.

The product is in beta, so check it out and send Daniel some feedback.

Comments closed

When Differential Backups Grow Larger Than Fulls

Kenneth Fisher notes that differential backups can end up being larger than full backups of the same database:

The thing about DBA Myths is that they are generally widespread and widely believed. At least I believed this one until I posted What’s a differential Backup?and Randolph West (b/t) pointed out that my belief that differential backups can’t get larger than full backups was incorrect. In fact, differential backups (like FULL backups) contain enough transaction log information to cover transactions that occur while the backup is taking place. So if the amount of data that needs to be backed up combined with transactions requires more space than just the data ….

Read on for a demonstration.

Comments closed

Native Math Libraries And Spark ML

Zuling Kang shares with us how we can use native math libraries in netlib-java to speed up certain machine learning algorithms in Apache Spark:

Spark’s MLlib uses the Breeze linear algebra package, which depends on netlib-java for optimized numerical processing.  netlib-java is a wrapper for low-level BLASLAPACK, and ARPACK libraries. However, due to licensing issues with runtime proprietary binaries, neither the Cloudera distribution of Spark nor the community version of Apache Spark includes the netlib-java native proxies by default. So without manual configuration, netlib-java only uses the F2J library, a Java-based math library that is translated from Fortran77 reference source code.

To check whether you are using native math libraries in Spark ML or the Java-based F2J, use the Spark shell to load and print the implementation library of netlib-java. The following commands return information on the BLAS library and include that it is using F2J in the line, “com.github.fommil.netlib.F2jBLAS,” which is highlighted below:

In the examples here, you can get about a 2x difference using the native math libraries versus without, so although that’s not an order of magnitude difference, it’s still nothing to sneeze at.

Comments closed

Kafka Cruise Control Frontend

Naresh Kumar Vudutha announces the Kafka Cruise Control Frontend:

For those that may be unfamiliar, Cruise Control features include:

1. Kafka broker resource utilization tracking
2. The ability to query the latest replica state (offline, URP, out of sync) from brokers
3. Goal-based resource distribution
4. Anomaly detection with self-healing
5. Admin operations on Kafka (add/remove/demote brokers, rebalance cluster, run PLE)

In this post, we will take a look at the frontend for Cruise Control, which provides a birds-eye view of all the Kafka installations and provides a single place to manage all of them.

That’s a lot of functionality in one tool.

Comments closed

Case Classes In Scala

Shubham Dangare explains what case classes are in Scala:

Case class is scale way to allow pattern matching on an object without requiring a large amount of boilerplate. All you need to do is add a single case keyword modifier to each class that you want to pattern matching using such modifier makes scala compiler add some syntactic conveniences to your class and compiler add companion object(with the apply method)
Adds factory method with the name of the class this means that for instance, you can write StringValue(“X”) to construct a StringValue object instead of using new StringValue(“X”)

Given how useful case classes are in Spark, it’s good to know how they operate. For more background on the topic, Alessandro Lacava has a post from a few years back describing the topic well.

Comments closed