Press "Enter" to skip to content

Day: March 21, 2023

An Overview of Kafka Streams

The Instaclustr team explains how stream processing works in Kafka Streams:

Kafka Streams is a client library providing organizations with a particularly efficient framework for processing streaming data. It offers a streamlined method for creating applications and microservices that must process data in real-time to be effective. Using the Streams API within Apache Kafka, the solution fundamentally transforms input Kafka topics into output Kafka topics. The benefits are important: Kafka Streams pairs the ease of utilizing standard Java and Scala application code on the client end with the strength of Kafka’s robust server-side cluster architecture.

Read on for an overview of how it works. And if you haven’t already, check out the prior post on Kafka so that you can experience the same slight mental perturbations I did when reading about “real-time” responses.

Comments closed

Real-Time Data Streaming and Apache Kafka

Kai Waehner explains how Apache Kafka is not real-time:

Real-time data beats slow data. It is that easy! But what is real-time? The term always needs to be defined when discussing a use case. Apache Kafka is the de facto standard for real-time data streaming. Kafka is good enough for almost all real-time scenarios. But dedicated proprietary software is required for niche use cases. Kafka is NOT the right choice if you need microsecond latency! This article explores the architecture of NASDAQ that combines critical stock exchange trading with low-latency streaming analytics.

Kai uses the much more appropriate term “near real-time,” which I agree with. My mental example of “real-time” is software that you’d put on a fighter jet (which was an actual example in my undergrad days of a real-time operating system). If people potentially die because your software takes 4 milliseconds to do a job it needs to do in 100 microseconds, that’s real-time. For most of us, near real-time is certainly enough.

Actually, I’d go one step further: for most of us, not-really-real-time is fine. So many cases of “The users needs this data in real time!” boil down to “The users really only look at this once a day and couldn’t act on faster information and some of our data sources only update once a day.” Swap ‘once a day’ with ‘once an hour’ or something like that and you have the large majority of projects which started out with “near real-time” requirements.

1 Comment

Databricks Power Tools in VS Code

Gerhard Brueckl has some tools for us:

As you probably know, we at paiqo have developed our Databricks extension for VSCode over the last years and are constantly adding new features and improving user experience. The most notable features are probably the execution of local notebooks against a Databricks cluster, a nice UI to manage clusters, jobs, secrets, repos, etc. and last but not least also a browser for your workspace and DBFS to sync files locally.

In February 2023 Databricks also published its own official VSCode extension which was definitely long awaited by a lot of customers (blogextension). It allows you to run a local file on a Databricks cluster and display the results in VSCode again. Alternatively you can also run the code as a workflow. I am sure we can expect much more features in the near future and Databricks investing in local IDE support is already a great step forward!

As you can imagine, I am working very closely with the people at Databricks and we are happy to also announce the next major release of our Databricks VSCode extension 2.0 which now also integrates with the official Databricks extension! To avoid confusion between the two extensions we also renamed ours to Databricks Power Tools so from now on you will see two Databricks icons on the very left bar in VSCode.

Click through to read more in the announcement and some of the things which have changed as a result of version 2.0.

Comments closed

Tips for Scaling Cassandra Clusters

Mario Tavares wants more zoom:

When the use case aligns with the architectural limitations, Cassandra excels at storing and accessing datasets up to petabytes in volume, delivering impressive throughput. As the data or workload volume grows, we expand the cluster linearly, ensuring consistent performance.

However, even when we adhere to the documentation and best practices and create an effective data model, we might encounter underperforming nodes or unexpected challenges with throughput scaling after a cluster expansion—and it’s not always clear what causes the imbalance. Linear scalability relies on the assumption that workload and data are evenly distributed across all nodes in a cluster, and the cluster capacity relates directly to the number of nodes. Sometimes, these conditions aren’t met, affecting linear scalability. So, we strive for scalability and balance and are willing to fulfill the necessary conditions.

Read on for a few common performance issues and what you can do about them.

Comments closed

Finding SQL Server Columns with Defaults

Tom Collins sticks to the defaults:

Do you have a sql query to check every  sql server database  column and identify if a default value is applied to the column?  

Click through for a script which does just that. Tom’s query goes against system views and there’s a separate way to get those details from sys.default_constraints if you prefer to have a second option. If you’re on an older version of SQL Server where CONCAT_WS() doesn’t exist, concatenate it yourself.

SELECT	CONCAT_WS('.', QUOTENAME(OBJECT_SCHEMA_NAME(c.object_id)), QUOTENAME(OBJECT_NAME(c.object_id))) AS TableName, AS ColumnName, AS DefaultConstraintName,	dc.definition AS DefaultConstraintDefinition
FROM sys.default_constraints dc	INNER JOIN sys.columns c	ON dc.parent_object_id = c.object_id	AND dc.parent_column_id = c.column_id;
Comments closed

Best Practices Assessment for Azure Arc-Enabled SQL Server Instances

Ganapathi Varma Chekuri takes us through an assessment:

Best practices assessment provides a mechanism to evaluate the configuration of your SQL Server. Once the best practices assessment feature is enabled, your SQL Server instance and databases are scanned to provide recommendations for things like SQL Server and database configurations, index management, deprecated features, enabled or missing trace flags, statistics, etc. Assessment run time depends on your environment (number of databases, objects, and so on), with a duration from a few minutes, up to an hour.

If you’re familiar with the assessment on Azure VMs, this is quite similar, though it extends to on-premises machines or VMs running in other cloud providers. This does require installing the agent and paying for an Arc-Enabled SQL Server instance, so it’s not free.

Comments closed

Contrasting Azure IoT Hub and Event Hub

Brian Bønk lays out a quick comparison:

When working with Azure Data Explorer and loading data to the storage engine, you might have some streaming devices or services that should land in the engine.

Azure provides two out-of-the-box services:

  1. Azure IoT Hub
  2. Azure Event Hub

At first glance it seems like teh two services are doing the exact same thing – sending events through to other services in Azure. But there are some differences.

Read on to see what these differences are.

Comments closed