Press "Enter" to skip to content

Author: Kevin Feasel

RCSI and ID-Driven ETL

Michael J. Swart shares a warning:

Yesterday, Kendra Little talked a bit about Lost Updates under RCSI. It’s a minor issue that can pop up after turning on RCSI as the default behavior for the Read Committed isolation level. But she doesn’t want to dissuade you from considering the option and I agree with that advice.

In fact, even though we turned RCSI on years ago, by a bizarre coincidence, we only came across our first RCSI-related issue very recently. But it wasn’t update related. Instead, it has to do with an ETL process. To explain it better, consider this demo:

Michael has one example solution. I could also see a “windback” run, where, instead of starting at the very end of the line for ETL, you start a few hundred rows earlier. That way, you can pick up any stragglers. It would add some overhead to the ETL task, but given that ETL jobs should be idempotent, it shouldn’t affect the end results.

Comments closed

Auto-Failover Groups in Azure SQL DB

Etienne Lopes wraps up a series:

So, first of all, what is Auto-failover groups?

The auto-failover groups feature allows you to manage the replication and failover of databases to another Azure region. You can include of a group of databases or all user databases in a logical server to be replicated to another logical server. It is a declarative abstraction on top of the active geo-replication feature, designed to simplify deployment and management of geo-replicated databases at scale.

Read on to see some of the benefits of this, as well as how to enable it.

Comments closed

Lost Updates with RCSI

Kendra Little shares a warning:

There are two isolation levels in SQL Server that use optimistic locking for disk-based tables:

  1. Read Committed Snapshot Isolation (RCSI), which changes the implementation of the default Read Committed Isolation level and enables statement-based consistency.
  2. Snapshot Isolation, which provides high consistency for transactions (which often contain multiple statements). Snapshot Isolation also provides support for identifying update conflicts.

Many folks get pretty nervous about RCSI when they learn that certain timing effects can happen with data modifications that don’t happen under Read Committed. The irony is that RCSI does solve many OTHER timing risks in Read Committed, and overall is more consistent, so sticking with the pessimistic implementation of Read Committed is not a great solution, either.

I don’t recall getting any kinds of update errors with RCSI and I’ve used it in some pretty heavy workloads.

Comments closed

An Intro to Databricks Asset Bundles

Dustin Vannoy covers one technique for CI/CD in Databricks:

Databricks Asset Bundles provides a way to version and deploy Databricks assets – notebooks, workflows, Delta Live Tables pipelines, etc. This is a great option to let data teams setup CI/CD (Continuous Integration / Continuous Deployment). Some of the common approaches in the past have been Terraform, REST API, Databricks command line interface (CLI), or dbx. You can watch this video to hear why I think Databricks Asset Bundles is a good choice for many teams and see a demo of using it from your local environment or in your CI/CD pipeline.

Click through for a video and some sample scripts.

Comments closed

How Kafka Consumers Keep Track of Position

The Big Data in Real World team explains:

Let’s say you have a consumer group which has 3 consumers at the moment consuming messages from a topic. Assume that you had to shut down all 3 consumers in the consumer group for some reason. Now when you restart the consumers in the consumer group, how does the consumers know from which offset they should read from the topic to avoid reading the same messages all over again which were already read before the consumers went down?

Read on for the answer.

Comments closed

Row-Level Security Attack Mitigations

Ben Johnston wraps up a series on row-level security:

As seen in the previous section, there are several ways bad actors can attempt to bypass RLS. Attacks range from removing RLS, getting data from other systems or straight brute-force methods using side-channel attacks. Mechanisms exist for each potential attack that allow you to avoid the attack or monitor for the attack when avoidance isn’t possible. This section covers those mitigations.

RLS can be bypassed or attacked using several broad categories. These include direct attacks, where RLS is modified in a malicious fashion or disabled, indirect attacks where information can be gathered without modifying the underlying RLS, and side-channel attacks that use specially crafted queries to derive data from RLS protected tables. Refer to the previous section of this series, RLS Attacks, for a full explanation of each attack type.

I’ve enjoyed this series and the end is no different.

Comments closed

Visualizing Kusto Graphs with Plotly and Python

Henning Rauch creates some plots:

Graphs are a powerful way to model and analyse complex relationships between entities, such as cybersecurity incidents, network traffic, social networks, and more. Kusto, the query and analytics engine of Azure Data ExplorerMicrosoft Fabric Real-Time Analytics and many more recently introduced a new feature that enables users to contextualize their data using graphs. In this blog post, we will show you how to use graph semantics to create and explore graph data in Kusto, and how to visualize it using Plotly, a popular library for interactive data visualization in Python.

Graph semantics are a set of operators that allow users to work with graph data in Kusto, without the need to use a separate graph database or framework.

Click through for the KQL you’ll need, as well as how to display that in Plotly.

Comments closed

Always Use Explicit Measures in Power BI

Ed Hansberry has some advice for us:

All of your analytics in Power BI are done with measures, and there are two ways to create measures. It is important to understand the difference between the two methods. If you drag a field from the data pane directly into a visual, that is an implicit measure. The model usually wraps it with SUM(), but other aggregations are possible, like COUNT() or AVERAGE().

Read on to see what’s wrong with using implicit measures, as well as how to make these explicit.

Comments closed

Drawing Horizontal Box Plots in R

Steven Sanderson is not limited to one axis:

Boxplots are a great way to visualize the distribution of a numerical variable. They show the median, quartiles, and outliers of the data, and can be used to compare the distributions of multiple groups.

Horizontal boxplots are a variant of the traditional boxplot, where the x-axis is horizontal and the y-axis is vertical. This can be useful for visualizing data where the x-axis variable is categorical, such as species or treatment group.

Click through for an example using base R and ggplot2.

Comments closed