Press "Enter" to skip to content

Curated SQL Posts

Limitations with Control Flows in Azure Data Factory

Meagan Longoria has a list:

If you’ve been using Azure Data Factory for a while, you might have hit some limitations that don’t exist in tools like SSIS or Databricks. Knowing these limitations up front can help you design better pipelines, so I’m listing a few here of which you’ll want to be aware.

1. You cannot nest For Each activities.
Within a pipeline, you cannot place a For Each activity inside of another For Each activity. If you need to iterate through two datasets you have two main options. You can combine the two datasets before you iterate over them. Or you can use a parent/child pipeline design where you move the inner For Each activity into the child pipeline. Fun fact: currently the Data Factory UI won’t stop you from nesting For Each activities. You won’t find out until you try to execute the pipeline.

Click through for several other limitations and workarounds.

Comments closed

An Introduction to Azure Purview Studio

Rahul Mehta takes a look at Azure Purview’s current user interface:

Metrics in Azure Purview are integrated and reported using Azure Monitor. Click on the Metrics section and it would have a link to Azure Monitor, which would open in a new tab as shown below. From this interface, we can filter and split different data metrics, plot the same on different types of graphs, as well as build custom dashboards to visualize the metrics to evaluate performance. The available metrics depend on the features enabled in the Azure Purview account. Metrics may have different levels of granularity and this may lead to a significant amount of data. The data generated can be aggregated in these dashboards using the available aggregation functions.

Click through for a quick walkthrough.

Comments closed

Last Observation Carried Forward in R

Nathan Eastwood shows how to perform Last Observation Carried Forward in R:

Real life data is often riddled with missing values – or NAs – where no data value are stored for the variable in observation. Missing data such as this can have a significant effect on the conclusions which can be drawn from the data. For example individuals dropping out of a study or subjects not properly reporting responses. A common solution to this problem is to fill those NA values with the most recent non-NA value prior to it – this is called the Last Observation Carried Forward (LOCF) method.

This blog post will look at how to implement this particular solution using a combination of {dplyr}, {dbplyr} and {sparklyr}.

Click through for the solution.

Comments closed

Generating Random Numbers in R

Holger von Jouanne-Diedrich brings the noise:

In data science, we try to find, sometimes well-hidden, patterns (= signal) in often seemingly random data (= noise). Pseudo-Random Number Generators (PRNG) try to do the opposite: hiding a deterministic data generating process (= signal) by making it look like randomness (= noise). If you want to understand some basics behind the scenes of this fascinating topic, read on!

Click through for an explanation of the process.

Comments closed

Managing tempdb Growth

Monica Rathbun shares part two in a series:

As you learned in part one of this Mastering TempDB series, TempDB is a global resource used for many operations within SQL Server. You create and allocate temporary user objects such as temporary tables and table variables, rebuild indexes with SORT_IN_TEMPDB=ON, use it for version stores (RCSI), internal objects (worktables, spools, group by, order by) and even DBCC CHECKDB just to name a few. All these operations require space to be allocated in the TempDB database. At times, these operations can result in TempDB growing rapidly, which, in turn, can fill up the file system and cause failures. In this article, you will learn how to fix an overgrown TempDB file that has resulted in it running out of space.

It beats my answer, which is to rip the server out of the rack and chuck it in the ocean.

Comments closed

Drift Monitoring with Azure Machine Learning

I take a look at dataset drift monitoring in Azure Machine Learning:

One of the things I like to say about machine learning model is, “shift happens.” By that, I mean that models lose effectiveness over time due to changes in underlying circumstances. Relationships between variables that used to hold no longer do, and so our model quality degrades. This means that we sometimes need to retrain models.

But there’s a cost to retraining models—that work can be computationally expensive and time-consuming. This concern is particularly salient if you’re in a cloud, as you pay directly for everything there. This means that we don’t want to retrain models unless we need to. But when do we know if we should retrain the model? We can watch for model degradation, but there’s another method: drift detection in your datasets.

Read on for a demonstration of how the product works and a couple of things to keep in mind.

Comments closed

Bug around Parallel Eager Spools and Batch Mode

Paul White digs into a nasty bug:

more accurate description of the issue would be:

This bug can cause wrong results, incorrect error messages, and statement failure when:

– A data-modification statement requires Halloween Protection.
– That protection is provided by a Parallel Eager Spool.
– The spool is on the probe side of a Batch Mode Hash Join.

This issue affects Azure SQL Database and SQL Server 2014 to 2019 inclusive.

Read on for a repro and Paul’s thoughts. As of March 2021, this is an active problem, so it’s worth keeping an eye on in your environment. I’d wager, though, that this probably doesn’t pop up on its own very frequently.

Comments closed

Granular Permissions for Dynamic Data Masking

John Martin reviews a change:

All the way back with SQL Server 2016 Microsoft released the Dynamic Data Masking feature in the database engine. It seemed like a huge step forward and promised so much, but there were severe limitations around the way that we could control who sees what masked data. It was a case of you either got to see masked data wherever it was configured, or you saw clear data, there was no granularity. I wrote about this and a few other things to do with Dynamic Data Masking all the way back in August of 2016 when I was at SentryOne. You can check that post out here. Also, back then I created several Connect items (blast from the past there), one of which was pulled over to the user voice replacement where I was asking for the UNMASK securable to be made more granular, you can check that out here.

So, why I am I writing this post? Well, it seems that our (my?) request has been granted. At least in Azure SQL Database. On March the 17th this year a little announcement slipped out stating “General availability: Dynamic data masking granular permissions for Azure SQL and Azure Synapse Analytics“. So, has this delivered on what we wanted, to really help this feature live up to its promise?

Read on to see how it works and what John thinks of the whole thing.

Comments closed

Tagging Azure Resources by Policy

Jess Pomfret makes tagging resources in Azure better:

Last week, in Part 1, we talked about how to easily keep track of our resources with tags. There are many strategies for tagging your resources but I specifically focused on adding a ‘dateCreated’ tag so we could see when resources were created – since this isn’t available by default.  During that post we identified the biggest issue we had was that we were relying on a human to remember to add the ‘dateCreated’ tag for every resource they created. I’ve got two ideas on how to fix that – today we’ll look at the first option, using Azure Policy.

Azure Policy is a way of comparing your Azure estate to defined requirements. You can either use predefined definitions (of which there are many) or create your own specific rules.  These definitions can be assigned to certain scopes (subscriptions, resource groups). Azure Policy then reports on whether you’re in the expected state and in some cases can alter resources to ensure you are.

Click through to see how to define a policy and then how to apply it to relevant resources.

Comments closed