Press "Enter" to skip to content

Curated SQL Posts

Updates to Change Data Capture in ADF

Chen Hirsh looks at some updates:

A few months ago I wrote a post about the new feature of change data capture (CDC) on Azure data factory (ADF) – https://www.madeiradata.com/post/the-wind-of-change-change-data-capture-in-data-factory

Change data capture, as the name suggests, gets the data changes on one system, and replicates them to another. Since this is a task that data engineers do a lot, this was a very welcome addition to ADF.

In this post, we’ll explore what is new on this front.

Click through for what’s new, though do be cognizant of which items are in GA and which are still in preview.

Comments closed

Finding a Particular Query Plan in Query Store’s UI

Andrea Allred does a search:

I have this problem where I want to see how a newly released query is performing, but it may not be bad enough to make any of the canned reports that SQL Server provides in QueryStore. I was able to look up the plan handle, but always struggled to get to the query id for QueryStore, until now.

Click through for a query to retrieve the query ID and then how to find data on that particular query. I’d also recommend QDSToolbox for more detailed query analysis.

Comments closed

Contrasting Spark and Flink for Streaming Use Cases

Deepthi Mohan and Karthi Thyagarajan contrast two products:

Apache Flink and Apache Spark are both open-source, distributed data processing frameworks used widely for big data processing and analytics. Spark is known for its ease of use, high-level APIs, and the ability to process large amounts of data. Flink shines in its ability to handle processing of data streams in real-time and low-latency stateful computations. Both support a variety of programming languages, scalable solutions for handling large amounts of data, and a wide range of connectors. Historically, Spark started out as a batch-first framework and Flink began as a streaming-first framework.

In this post, we share a comparative study of streaming patterns that are commonly used to build stream processing applications, how they can be solved using Spark (primarily Spark Structured Streaming) and Flink, and the minor variations in their approach. Examples cover code snippets in Python and SQL for both frameworks across three major themes: data preparation, data processing, and data enrichment. If you are a Spark user looking to solve your stream processing use cases using Flink, this post is for you. We do not intend to cover the choice of technology between Spark and Flink because it’s important to evaluate both frameworks for your specific workload and how the choice fits in your architecture; rather, this post highlights key differences for use cases that both these technologies are commonly considered for.

Read on for an analysis of the two products.

Comments closed

Tips for Limiting Redis Failures

Phil Booth provides the ammo and we provide the feet:

Production outages are great at teaching you how not to cause production outages. I’ve caused plenty and hope that by sharing them publicly, it might help some people bypass part one of the production outage learning syllabus. Previously I discussed ways I’ve broken prod with PostgreSQL and with healthchecks. Now I’ll show you how I’ve done it with Redis too.

For the record, I absolutely love Redis. It works brilliantly if you use it correctly. The gotchas that follow were all occasions when I didn’t use it correctly.

My one addition here is to be really careful if you use Redis as persistent storage rather than a cache. Redis as a cache is easy: if the server goes down or you have trouble, you simply have more database calls than normal. Redis as persistent storage is a much more complicated beast which seems to fall over a lot more often and is significantly more finicky about drivers.

Comments closed

Mitigating Dynamic Data Masking Side-Channel Attacks

Ben Johnston wraps up a series on dynamic data masking:

This is the fifth and final part of this series on SQL Server Dynamic Data Masking. The first part in the series was a brief introduction to dynamic data masking, completing solutions, and use cases. The second part covered setting up masking and some examples. The third and fourth sections explored side channel attacks against dynamic data masking.

This final part covers mitigations to side channel attacks, additional architectural considerations and an analysis of the overall solution.

Throughout the entire series, Ben has done a good job of laying out exactly what dynamic data masking is good for—and what it isn’t good for. I tend to harp a lot on the latter but Ben keeps a reasonable approach throughout this series.

Comments closed

Trying Fabric Data Wrangler

Reza Rad looks at a new tool:

There is a tool (or you can consider it as an editor) in Fabric for data scientists. As a data scientist, you must work with the data, clean it, group it, aggerate it, and do other data preparation work. This might be needed to understand the data or be part of the process you do to prepare the data and load it into a table for further analysis. Data Wrangler is a tool that gives you such ability. You can use it to transform data and prepare and even generate Python code to make this process part of a bigger data analytics project.

Data Wrangler has a simple-to-use graphical user interface that makes the job of a data scientist easier.

Read on for a video as well as a demo in written format.

Comments closed

MAXDOP by Username in Azure SQL DB

Jose Manuel Jurado Diaz comes up with a solution:

Azure SQL Database is a powerful platform that provides managed database services with built-in intelligence and robust resource management. While Azure SQL Database doesn’t have a direct implementation of the traditional Resource Governor feature available in SQL Server, we can explore a pseudo-Resource Governor approach using user-defined functions and custom tables. In this article, we’ll discuss the concept, present a sample implementation using a custom function, and highlight the possibilities it opens up for controlling CPU resources in Azure SQL Database.

Click through for the UDF and how to use it. My first inclination was to say that I couldn’t see it working well at all under load, though on second thought, performance won’t be bad like having a UDF execute for each row in a table, so it’s probably more of a manageable overhead.

Comments closed

Tracking Power BI Import Throughput Variance

Chris Webb continues a series on using Log Analytics with Power BI:

In the second post in this series I discussed a KQL query that can be used to analyse Power BI refresh throughput at the partition level. However, if you remember back to the first post in this series, it’s actually possible to get much more detailed information on throughput by looking at the ProgressReportCurrent event, which fires once for every 10000 rows read during partition refresh.

Here’s yet another mammoth KQL query that you can use to analyse the ProgressReportCurrent event data:

Click through for the KQL query, an explanation of how it works, and some practical examples.

Comments closed

Cumulative Means in R

Steven Sanderson performs a moving average:

The cumulative mean, also known as the running mean or moving average, provides us with a dynamic view of how the average value of a dataset changes as new observations are added incrementally. It is an invaluable tool in time-series analysis, trend identification, and smoothing noisy data.

Imagine you have a series of numeric values, and you want to find the average of the first observation, then the average of the first two observations, followed by the average of the first three, and so on. This iterative process generates the cumulative mean, painting a picture of how the data behaves over time.

Often times, we care about the moving average over a specific window, such as the last n periods. This particular post covers the moving average over the entire set of data.

Comments closed

Full-Text Search in Cosmos DB via Cognitive Services

Hasan Savran performs a search:

Incorporating Full-Text Search functionality into your application can enable users to locate what they are searching for effortlessly. Searching for specific words or phrases within a database has always been a difficulty, particularly for relational databases. Throughout my career, I’ve had countless discussions/arguments with DBAs about the importance of implementing full-text search in a relational database. We are in totally different times, now users want to search by voice, image, or video.

     Full-Text Search functionality is not part of Azure Cosmos DB’s Database Engine. Firstly, we must establish the Azure Cognitive Search service and link the data from Azure Cosmos DB to the Search Service. The process of setting up Azure Cognitive Search is relatively straightforward. Like other Azure services, you will need to answer similar types of questions beforehand. (Subscription, Resource Group, a name for the service, region, and tier)

By the way, Azure Cognitive Search is very similar to Elasticsearch, for those of you familiar with that technology.

Comments closed