Press "Enter" to skip to content

Day: December 22, 2023

The Data Streaming Landscape in 2024

Kai Waehner gives us an overview of where data streaming technologies are at:

The research company Forrester defines data streaming platforms as a new software category in a new Forrester Wave. Apache Kafka is the de facto standard used by over 100,000 organizations. Plenty of vendors offer Kafka platforms and cloud services. Many complementary open source stream processing frameworks like Apache Flink and related cloud offerings emerged. And competitive technologies like Pulsar, Redpanda, or WarpStream try to get market share leveraging the Kafka protocol. This blog post explores the data streaming landscape of 2024 to summarize existing solutions and market trends. The end of the article gives an outlook to potential new entrants in 2025.

Kai is Kafka-centric but this is a good overview of the industry and worth taking the time to read.

Comments closed

Time Travel in Delta Tables

Manish Mishra shows off some of the query capabilities with delta tables:

Delta Time Travel is a feature that is provided by Delta Lake. Delta time travel allows the user to switch to the previous version of the delta table.

Some of the benefits of Delta Time Travel are:

  • Historical Data Analysis
  • Rollback to the previous version in case of new data quality is not valid
  • Supports Schema Evolution

Click through for examples of each of these.

Comments closed

Notebook Concurrency in Microsoft Fabric

Ed Oldham takes us through a common problem:

If you are currently using Microsoft Fabric you will have some sort of capacity associated with your account. This will have a large impact on what you can run concurrently. If you are on a Fabric Trial, you will have access to a trial capacity and if you are paying you will be on a certain capacity tier based on how much you pay. The following diagram shows information about each level of capacity and the Trial. The Trial resembles F64 capacity but is apparently different in some important ways (More on that later).

Read on to learn more about capacity and what that means for concurrent notebooks and Spark jobs.

Comments closed

Table Results for DBCC PAGE

Andy Yun is pleased:

Am playing around with Always Encrypted for the first time. I was just following along the basic tutorial and encrypted some columns in my AutoDealershipDemo database. But then I decided to go crack open the data page using my friend DBCC PAGE.

Read on to see how you can get the results of DBCC PAGE into a table. My recollection is that there are some limits to what it can write into the table, but it’s pretty good on the whole.

Comments closed

Load Balancing across Azure SQL DBs

Jose Manuel Jurado Diaz scales out:

In today’s data-driven landscape, we are presented with numerous alternatives like Elastic Queries, Data Sync, Geo-Replication, ReadScale, etc., for distributing data across multiple databases. However, in this approach, I’d like to explore a slightly different path: creating two separate databases containing data from the years 2021 and 2022, respectively, and querying them simultaneously to fetch results. This method introduces a unique perspective in data distribution — partitioning by database, which could potentially lead to more efficient resource utilization and enhanced performance for each database. While partitioning within a single database is a common practice, this idea ventures into partitioning across databases.

Click through to see what the code looks like for this.

Comments closed

Advent of Code Day 6

Kevin Wilkie continues the advent of code series. The first part builds a small tally table and a loop:

Today we’re going racing! Sadly, it’s so not F1 or NASCAR racing. Snail racing is more like it since we’re moving millimeters by the end, but at least we’re closer to getting snow back to the elves, so let’s go racing!

Given a few numbers that are times and current record distances, this actually doesn’t look too bad to work with. First, as always, we have to load our data into SQL Server. This time, I loaded all of it into one table.

The second part goes back to the big tally table:

Sadly, this does make our numbers rather large, so we’re back to using the big ole Tally table we created for Walking Through Advent of Code Day 5.

This time I made it a little simpler on myself and just removed all of the spaces myself and placed the data in variables (one for time and one for distance). I thought this was an excellent idea since only one number would come out of all of this work.

Comments closed

The Value of Indexing Foreign Key Columns

Etienne Lopes takes us through a scenario:

Let me start this post with a question, “Do you think that it can be beneficial to have a single column index for the foreign key column in the child table?

Well, I believe I can ear three types of answers to this question:

  • Always!
  • Never!
  • It Depends…

Click through for Etienne’s answer. I’d still prefer these indexes to have multiple uses, which generally means having enough columns on the index to act as a covering index for one or more important queries. But Etienne does show a good use case for this single-column index.

Comments closed

Data Types and Stored Procedures

Erik Darling plays the roles of both Goofus and Gallant here:

All sorts of bad things happen when you do this. You can’t index for this in any meaningful way, and comparing non-string data types (numbers, dates, etc.) with a double wildcard string means implicit conversion hell.

You don’t want to do this. Ever.

Unless you want to hire me.

Click through for good advice on the proper use of data types and input parameters.

Comments closed