Press "Enter" to skip to content

Curated SQL Posts

Use TOP instead of SET ROWCOUNT

Jared Poche explains why the TOP clause is superior to using SET ROWCOUNT:

I was presenting on how to use the TOP clause to break down large operations into short, fast, bite-sized operations. The mechanics are things I learned from writing processes that do garbage collection, backfill new columns, and anonymizing PII data on existing tables. I’ve just posted the slides and example scripts here if you are interested.

ARE THEY THE SAME?

The question was whether the SET ROWCOUNT command would work just the same, and the answer is sometimes yes but largely no.

Read on to see what Jared means.

Comments closed

Cosmos DB Continuous Backup

Gauri Mahajan explains how continuous backup works in Cosmos DB:

To follow the exercise in this article, we would need an Azure Account with administrative privileges to operate the Azure Cosmos DB service. It is assumed that such an Azure Account and setup is already available and ready for use. The Continuous backup option is not available for all the editions of Cosmos DB and works only under certain configurations. The focus of this exercise is to understand the use-cases and options under which continuous backup would work in Cosmos DB.

There are four restrictions (as of the draft of this article), which should be kept in view while considering using continuous backup.

Read on for those restrictions and then how you can configure continuous backup to run.

Comments closed

Azure Synapse Analytics Updates

Saveen Reddy catalogs what’s new in Azure Synapse Analytics:

Quick Reuse of Spark clusters

By default, every data flow activity spins up a new Spark cluster based upon the Azure Integration Runtime (IR) configuration. Cold cluster start-up time takes a few minutes. If your pipelines contain multiple sequential data flows, you can enable a time-to-live (TTL) value, which keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL duration, it will reuse the existing cluster and start up time will be greatly reduced.

Read on for the full list of updates.

Comments closed

Az Powershell Modules and Users

Rayis Imayev troubleshoots an Azure Data Factory deployment:

To run these scripts, you will need to have Azure PowerShell installed on your DevOps agents (Azure Pipeline Agents). If your pipeline agents are Microsoft-hosted, then you’re good and all maintenance and software updates are taken care of for you. However, when you implement and install self-hosted agents, then additional software and component installation is solely your responsibility to maintain.

Recently, while I was configuring those pre- and post-deployment scripts for my Azure Data Factory deployment, I received the following error message, “Could not find the modules: ‘Az.Accounts’ with Version: ””. 

Rayis does have a working solution, but I do recommend against installing modules in System32 because that directory is supposed to be reserved for Windows. Instead, for multi-user Powerhsell modules, I’d drop them in %ProgramFiles%\WindowsPowerShell\Modules instead, following the general Powershell guidance.

Comments closed

Log4j and the Microsoft Data Platform

Andreas Wolter has some guidance for us:

Microsoft published guidance regarding Log4j 2 vulnerability for customers using Azure Data services. Please find the latest information here:

Microsoft’s Response to CVE-2021-44228 Apache Log4j 2 – Microsoft Security Response Center

The published list shows affected products only. 

For SQL Server, even components which use log4j, the version is old enough that it is not affected by the series of exploits, bugs, and exploits of bugs which were introduced to try to fix the prior round of exploited bugs. The big exception is Big Data Clusters and if you happened to install log4j on your own.

Comments closed

Diving into Spark Streaming

Tomaz Kastrun continues a series on Spark and is well into a section on Spark Streaming. Part 17 looks at watermarks:

Streaming data is considered as continuously ingested data with particular frequency and latency. It is considered “big data” and data that has no discrete beginning nor end.

The primary goal of any real-time stream processing system is to process the streaming data within a window frame (or considered this as frequency). Usually this frequency is “as soon as it arrives”. On the other hand, latency in streaming processing model is considered to have the means to work or deal with all the possible latencies (one second or one minute) and provides an end-to-end low latency system. If frequency of data analysing is on user’s side (destination), latency is considered on the device’s side (source).

Part 18 enumerates the supported types of windows:

Tumbling windows are fixed sized and static. They are non-overlapping and are contiguous intervals. Every ingested data can be (must be) bound to a singled window.

Sliding windows are also fixed sized and also static. Windows will overlap when the duration of the slide is smaller than the duration of the window. Ingested data can therefore be bound to two or more windows

Session windows are dynamic in size of the window length. The size depends on the ingested data. A session starts with an input and expands if the following input expands if the next ingested record has fallen within the gap duration.

Part 19 includes good information on how data engineers can work with streams of data:

Streaming data can be used in conjunction with other datasets. You can have Joining streaming data, joining data with watermarking, deduplication, outputting the data, applying foreach logic, using triggers and creating Stream API Tables.

All of the functions are available in Python, Scala and Java and some are not available with R. We will be focusing on Python and R.

Check out all three of these posts.

Comments closed

Simulating Slow Data Sources in Power BI

Chris Webb builds a particular kind of test:

As a postscript to my series on Power BI refresh timeouts (see part 1part 2 and part 3) I thought it would be useful to document how I was able to simulate a slow data source in Power BI without using large data volumes or deliberately complex M code.

It’s relatively easy to create an M query that returns a table of data after a given delay. For example, this query returns a table with one column and one row after one hour and ten minutes:

Read on for a version of the function which slowly emits rows, as well as some T-SQL which slowly emits rows.

Comments closed

Power BI Desktop Hardening

Matthew Roche explains a bit of jargon:

What I am going to do here is talk a little about “desktop hardening,” which is a term I’ve heard hundreds of times in meetings and work conversations, but which I discovered is practically undefined and unmentioned on the public internet. The only place where I could find this term used in context was in this reply from Power BI PM Christian Wade… and if you don’t already know what “desktop hardening” is, seeing it used in context here isn’t going to help too much.

When I hear that term, I definitely do not think about APIs; instead, the first thing which comes to mind is security, which is a totally different story.

Comments closed