Press "Enter" to skip to content

Curated SQL Posts

Marking Replication Transactions Complete

Andrea Allred spams the “burn it down” button:

Replication is not my favorite, it is kind of far from my favorite. No further than that. Little further.

When it breaks, it can cause havoc and it always seems to break at the worst time. Recently we noticed that our logfile was massive (like 3 times the size of the database) and that was making many of the other processes painful. We didn’t know how long the log hadn’t been clearing so we got to burn it all (kind of).

The first thing I did was tell replication that we were done with all the transactions that had been committed.

I’d say about 40-50% of the pain of replication is in how difficult it is to troubleshoot. Transactional replication is an order of magnitude easier than merge replication, too, especially on systems of non-trivial size and scale. The single most common question I get is “When will this row be replicated to the other side?” I can’t answer that with merge replication. The second-most common question is, “Why are things slower right now than before?” Can’t answer that either…

Comments closed

Filtering with DAX for Paginated Reports

Adam Aspin takes us through an important topic for paginated report developers:

In the previous article of this short series, you learned the fundamentals of creating datasets using DAX to populate paginated reports delivered using the Power BI Premium service. The next step is to appreciate the practicalities – and subtleties – of how data can be filtered using DAX for paginated report output.

As most, if not all, report developers come from an SQL background, it may seem overkill to devote an entire article to filtering data. However, DAX is very unlike SQL as far as filtering output data is concerned. Something as simple as classic OR logic needs to be handled differently from the techniques you may be used to – either as a SQL or as a Power BI developer. To ensure that you can deliver the report data that you need to populate paginated reports, take a detailed look at how to filter data in DAX datasets using the core SUMMARIZECOLUMNS() function.

Read the whole thing.

Comments closed

Use TOP instead of SET ROWCOUNT

Jared Poche explains why the TOP clause is superior to using SET ROWCOUNT:

I was presenting on how to use the TOP clause to break down large operations into short, fast, bite-sized operations. The mechanics are things I learned from writing processes that do garbage collection, backfill new columns, and anonymizing PII data on existing tables. I’ve just posted the slides and example scripts here if you are interested.

ARE THEY THE SAME?

The question was whether the SET ROWCOUNT command would work just the same, and the answer is sometimes yes but largely no.

Read on to see what Jared means.

Comments closed

Cosmos DB Continuous Backup

Gauri Mahajan explains how continuous backup works in Cosmos DB:

To follow the exercise in this article, we would need an Azure Account with administrative privileges to operate the Azure Cosmos DB service. It is assumed that such an Azure Account and setup is already available and ready for use. The Continuous backup option is not available for all the editions of Cosmos DB and works only under certain configurations. The focus of this exercise is to understand the use-cases and options under which continuous backup would work in Cosmos DB.

There are four restrictions (as of the draft of this article), which should be kept in view while considering using continuous backup.

Read on for those restrictions and then how you can configure continuous backup to run.

Comments closed

Azure Synapse Analytics Updates

Saveen Reddy catalogs what’s new in Azure Synapse Analytics:

Quick Reuse of Spark clusters

By default, every data flow activity spins up a new Spark cluster based upon the Azure Integration Runtime (IR) configuration. Cold cluster start-up time takes a few minutes. If your pipelines contain multiple sequential data flows, you can enable a time-to-live (TTL) value, which keeps a cluster alive for a certain period of time after its execution completes. If a new job starts using the IR during the TTL duration, it will reuse the existing cluster and start up time will be greatly reduced.

Read on for the full list of updates.

Comments closed

Az Powershell Modules and Users

Rayis Imayev troubleshoots an Azure Data Factory deployment:

To run these scripts, you will need to have Azure PowerShell installed on your DevOps agents (Azure Pipeline Agents). If your pipeline agents are Microsoft-hosted, then you’re good and all maintenance and software updates are taken care of for you. However, when you implement and install self-hosted agents, then additional software and component installation is solely your responsibility to maintain.

Recently, while I was configuring those pre- and post-deployment scripts for my Azure Data Factory deployment, I received the following error message, “Could not find the modules: ‘Az.Accounts’ with Version: ””. 

Rayis does have a working solution, but I do recommend against installing modules in System32 because that directory is supposed to be reserved for Windows. Instead, for multi-user Powerhsell modules, I’d drop them in %ProgramFiles%\WindowsPowerShell\Modules instead, following the general Powershell guidance.

Comments closed

Log4j and the Microsoft Data Platform

Andreas Wolter has some guidance for us:

Microsoft published guidance regarding Log4j 2 vulnerability for customers using Azure Data services. Please find the latest information here:

Microsoft’s Response to CVE-2021-44228 Apache Log4j 2 – Microsoft Security Response Center

The published list shows affected products only. 

For SQL Server, even components which use log4j, the version is old enough that it is not affected by the series of exploits, bugs, and exploits of bugs which were introduced to try to fix the prior round of exploited bugs. The big exception is Big Data Clusters and if you happened to install log4j on your own.

Comments closed

Diving into Spark Streaming

Tomaz Kastrun continues a series on Spark and is well into a section on Spark Streaming. Part 17 looks at watermarks:

Streaming data is considered as continuously ingested data with particular frequency and latency. It is considered “big data” and data that has no discrete beginning nor end.

The primary goal of any real-time stream processing system is to process the streaming data within a window frame (or considered this as frequency). Usually this frequency is “as soon as it arrives”. On the other hand, latency in streaming processing model is considered to have the means to work or deal with all the possible latencies (one second or one minute) and provides an end-to-end low latency system. If frequency of data analysing is on user’s side (destination), latency is considered on the device’s side (source).

Part 18 enumerates the supported types of windows:

Tumbling windows are fixed sized and static. They are non-overlapping and are contiguous intervals. Every ingested data can be (must be) bound to a singled window.

Sliding windows are also fixed sized and also static. Windows will overlap when the duration of the slide is smaller than the duration of the window. Ingested data can therefore be bound to two or more windows

Session windows are dynamic in size of the window length. The size depends on the ingested data. A session starts with an input and expands if the following input expands if the next ingested record has fallen within the gap duration.

Part 19 includes good information on how data engineers can work with streams of data:

Streaming data can be used in conjunction with other datasets. You can have Joining streaming data, joining data with watermarking, deduplication, outputting the data, applying foreach logic, using triggers and creating Stream API Tables.

All of the functions are available in Python, Scala and Java and some are not available with R. We will be focusing on Python and R.

Check out all three of these posts.

Comments closed