Press "Enter" to skip to content

Curated SQL Posts

Running Kafka on Windows (via WSL2)

Jim Galasyn shows how you can try out Apache Kafka on Windows:

Is Windows your favorite development environment? Do you want to run Apache Kafka® on Windows? Thanks to the Windows Subsystem for Linux 2 (WSL 2), now you can, and with fewer tears than in the past. Windows still isn’t the recommended platform for running Kafka with production workloads, but for trying out Kafka, it works just fine. Let’s take a look at how it’s done.

You can also get Kafka to run natively on Windows, though there are bugs around file handling, to the point where if you restart your machine while the Kafka service is running, data in partitions may become permanently inaccessible and force you to delete it before you can start Kafka again. So yeah, it’s better to use WSL or Docker containers for trying out Kafka on Windows machines.

Comments closed

Apache Flink 1.12.0 Released

Marta Paes and Aljoscha Krettek announce a new release of Apache Flink:

– The community has added support for efficient batch execution in the DataStream API. This is the next major milestone towards achieving a truly unified runtime for both batch and stream processing.

Kubernetes-based High Availability (HA) was implemented as an alternative to ZooKeeper for highly available production setups.

– The Kafka SQL connector has been extended to work in upsert mode, supported by the ability to handle connector metadata in SQL DDL. Temporal table joins can now also be fully expressed in SQL, no longer depending on the Table API.

– Support for the DataStream API in PyFlink expands its usage to more complex scenarios that require fine-grained control over state and time, and it’s now possible to deploy PyFlink jobs natively on Kubernetes.

Read on for more details on these as well as other changes.

Comments closed

Moving Away from the Lambda Architecture

Xiang Zhang and Jingyu Zhu talk about migrating a project away from the Lambda architecture:

The Lambda architecture has become a popular architectural style that promises both speed and accuracy in data processing by using a hybrid approach of both batch processing and stream processing methods. But it also has some drawbacks, such as complexity and additional development/operational overheads. One of our features for Premium members on LinkedIn, Who Viewed Your Profile (WVYP), relied on a Lambda architecture for some time. The backend system supporting this feature had gone through a few architectural iterations in the past years: it started as a Kafka client processing a single Kafka topic, and eventually evolved to a Lambda architecture with more complicated processing logic. However, in an effort to pursue faster product iteration and lower operational overheads, we recently underwent a transition to make it Lambda-less. In this blog post, we’ll share some of the lessons learned in operating this system in the Lambda architecture, the decisions made in transitioning to Lambda-less, and the shifts necessary to undergo this transition.

When Lambda was first proposed back in 2015, it was intended as a compromise architecture trying to solve several important problems with the tools available in 2015 (well, 2013 and 2014—it was in a book, after all). I could definitely see the architecture fall into disuse within the next decade, not because it was at all bad, but because the world around it changed to the point that there is a better compromise available.

Comments closed

Batch Mode with Window Functions and Parallelism

Erik Darling has a two-parter on how using batch mode processing when working with window functions can lead to better performance. Part 1 sets the stage:

If you ask people who tune queries why batch mode is often much more efficient with windowing functions, they’ll tell you about the window aggregate operator.

That’s all well and good, but there’s another, often sneaky limitation of fully row mode execution plans with windowing functions.

Let’s go take a look!

Part 2 identifies the culprit:

When queries go parallel, you want them to be fast. Sometimes they are, and it’s great.

Other times they’re slow, and you end up staring helplessly at a repartition streams operator.

Check out both of these posts.

Comments closed

Knowing a Server’s Limits

Pam Mooney shows how you can discover and document limits for a SQL Server instance:

Having taken steps to map your database applications to the databases and address your security and backups, you need to turn your attention to your server’s limits.

What do I mean by limits? Certainly, this is an allusion to how you will monitor your server drive capacity, but I also mean how you will track resource limits, such as latency, CPU, memory, and wait stats. Understanding all of these terms, what normal values are for your server, and what to do to help if the values are not normal, will help to keep your servers as healthy as possible.

These measures are big, in-depth topics in and of themselves. This will only serve to get you started. Links to more in-depth resources are included with each topic, and you will doubtless find others as you progress through your career.

Click through for the process.

Comments closed

EDW: Maintenance is Costlier than Development

Andy Leonard argues that with an Enterprise Data Warehouse, development is the less expensive side of the coin:

Crossing the threshold between “bad data” and “data that is too bad” is somewhat dependent on how the data is being used by the enterprise.

Don’t let that simplistic-sounding response trip you up. Please recognize two truths about the sentence above:

1. The sentence above has nothing to do with math.
2. The sentence above has everything to do with enterprise culture; specifically, enterprise data culture.

Read the whole thing.

Comments closed

Data Professional Salary Survey

Brent Ozar has another year of the Data Professional Salary Survey:

Take the Data Professional Salary Survey now.

The anonymous survey closes Friday, Jan 1, 2021. The results are completely open source, and shared with the community for your analysis. (You can analyze ’em now mid-flight, but I’d wait until the final results come in. I’ll combine them into a single spreadsheet with the past results.)

I’ve had fun analyzing it over the years. If you wouldn’t mind, please fill it out and add some more data points.

Comments closed

Setting up Azure Purview for Power BI

Soheil Bakhshi has a great step-by-step walkthrough for setting up Azure Purview:

Microsoft newly announced a piece of very exciting news that Azure Purview now supports Power BI. This is massive news from a data governance point of view. Azure Purview is the next generation of Azure Data Catalog with more metadata discovery power and the ability to use sensitivity labels. After reading the news, I immediately decided to set up my test environment and give it a go. I followed the steps mentioned in this article on the Microsoft documentation website but I faced some difficulties to get it to work. And here we are, another blog post to help you to set up the Azure Purview for Power BI.

Click through for a detailed walkthrough.

Comments closed

All About Dot Plots

Cole Nussbaumer Knaflic talks about one of my favorite plot types:

The term “dot plot” can be used for any graph that is encoding data in a dot or small circle. There are a few common types that I’ll focus on here. If you’ve ever asked yourself—What is a dot plot? How do I interpret a dot plot? When should I use a dot plot? or What are pros and cons of dot plots?—you’ll find the answers in this post. I’ll also share some tips on creating them and where to find examples that will inform and inspire.

Read the whole thing.

Comments closed