Press "Enter" to skip to content

Category: Cloud

Metadata-Driven ADF Pipelines for Synapse

Hope Foley is back:

So what’s the hard thing I want to help make easier in this post?  Metadata driven pipelines in Azure Data Factory!  I had the opportunity awhile back to work with a customer who was pulling data out of large SQL Servers to eventually land data into Azure Synapse SQL pools back when they were still Azure SQL DW.  We created a couple load pattern pipelines that used metadata in Azure SQL DB to load Synapse sql pool tables from parquet files in Azure Data Lake Storage (ADLS) Gen 2. 

Not gonna lie, the pipelines weren’t easy for me to learn to setup initially.  Big thanks to Catherine for your blog which was a life preserver in the hardest parts!  So I wanted to see if I could automate it in my old friend PowerShell. 

It would also be worth looking at some of the work Paul Andrew has done around ADF.procfwk for another approach to the problem.

Comments closed

Copying an Azure SQL Database

Garry Bargsley gives us two methods for copying an Azure SQL Database:

Copying an Azure SQL Database is a vital skill when managing cloud databases. Recently, a request was received from the “business”. They wanted to create a copy of an Azure SQL Database that was in a development environment. The database has been certified and early testing was accepted. They now want an exact copy in QA to start integration testing. The process of making an Azure SQL Database copy is straightforward. There are several different ways to perform this action.

Two methods chosen will use the Azure Portal and PowerShell to demonstrate the completion of this request.

Click through for the demos.

Comments closed

Tips for Optimizing Dedicated SQL Pools in Synpase Analytics

Tsuyoshi Matsuzaki shares some tips for improving query performance when using Dedicated SQL Pools in Azure Synapse Analytics:

By above BROADCAST_MOVE operation, the rows in dimension_City table are all copied in a temporary table (called TEMP_ID_3) on all distributed database. (See below.)
Since the size of dimension_City is small, then all rows in this table is duplicated in all database before joining. This time, we join only 2 tables, however, if a lot of tables are needed to join, this data movement will become large overhead for query execution.

The short version is, replicate smaller dimensions and align distribution keys for large tables which get joined together. Both of these minimize the changes of the engine needing to shuffle data between nodes. These sorts of things can make a huge difference when working with Dedicated SQL Pools, cutting query time down by an order of magnitude in some extreme cases.

Comments closed

Cosmos DB Serverless

Hasan Savran takes us through a new approach to Cosmos DB pricing:

There used to be two ways for Azure Cosmos DB to bill you for the services it provides. Those were Manual Provisioned throughput and Auto scale provisioned throughout.          Provisioned throughput is number of request units available for your applications to use per second. For example, Let’s say you picked 400 Request Units. That means your application’s budget is 400 request units per second. Depending on your needs, you can scale up or down. You can set this manually and scale up and down manually if you like. That will be the cheapest option which is 100 Request Units for $0.008 per hour. Your other option is the auto scale option, Cosmos DB scales up and down automatically with this option. All you need to tell Cosmos DB is, what the highest number it can scale up to. This option is 50% higher than the first option.

    These are great options, but they can be still be expensive for what you need.

Read on for what we know about Serverless pricing and which APIs currently support the Serverless model.

Comments closed

Projecting Defensive Back Trajectories with Sagemaker

Lin Lee Cheong, et al, relay some interesting research:

NFL’s Next Gen Stats (NGS) powered by AWS accurately captures player and ball data in real time for every play and every NFL game—over 300 million data points per season—through the extensive use of sensors in players’ pads and the ball. With this rich set of tracking data, NGS uses AWS machine learning (ML) technology to uncover deeper insights and develop a better understanding of various aspects and trends of the game. To date, NGS metrics have focused on helping fans better appreciate and understand the offense and defense in gameplay through the application of advanced analytics, particularly in the passing game. Thanks to tracking data, it’s possible to quantify the difficulty of passes, model expected yards after catch, and determine the value of various play outcomes. A logical next step with this analytical information is to evaluate quarterback decision-making, such as whether the quarterback has considered all eligible receivers and evaluated tradeoffs accurately.

To effectively model quarterback decision-making, we considered a few key metrics—mainly the probability of different events occurring on a pass, and the value of said events. A pass can result in three outcomes: completion, incompletion, or interception. NGS has already created models that provide probabilities of these outcomes, but these events rely on information that’s available at only two points during the play: when the ball is thrown (termed as pass-forward), and when the ball arrives to a receiver (pass-arrived). Because of this, creating accurate probabilities requires modeling the trajectory of players between those two points in time.

For these probabilities, the quarterback’s decision is heavily influenced by the quality of defensive coverage on various receivers, because a receiver with a closely covered defender has a lower likelihood of pass completion compared to a receiver who is wide open due to blown coverage. Furthermore, defenders are inherently reactive to how the play progresses. Defenses move in completely different ways depending on which receiver is targeted on the pass. This means that a trajectory model for defenders has to similarly be reactive to the specified targeted receiver in a believable manner.

Click through for details on the study.

Comments closed

Stopping and Starting an Azure Kubernetes Service Cluster

Mohammad Darab wants to save some cash (or at least Azure credits):

I remember when I first started deploying Big Data Clusters, they were on Azure Kubernetes Service utilizing the $200 credit for first time sign ups. By the time I got around to figuring out how to deploy the BDC, not only was my $200 credit gone, but I started to incur cost out of pocket.

If only there was a feature that would allow me to stop the VMs in AKS whenever I wasn’t using them. Well, I’m excited to share that Microsoft AKS (Azure Kubernetes Service) came out with a neat feature (currently in preview at the time of the publishing of this post) that allows you to stop and start your AKS cluster by running a simple command. Of course I had to try it out on BDCs and to my surprise it worked. Well, sort of. Let me explain…

Read on for more information, as well as current limitations.

Comments closed

Database Mail on Azure SQL Managed Instances

John McCormack shows how you can set up database mail from an Azure SQL Managed Instance:

It’s not too difficult to set up database mail for Azure SQL DB Managed Instance in comparison to SQL Server (on-prem or IaaS) however there are a few extra things to consider. This post will describe how to set up database mail for Azure SQL DB Managed Instance. I will use Sendgrid as the mail provider but you can follow the same steps for any other mail provider or your company’s smtp server.

Before I go on, my personal opinion is that including database mail is a massive feature for Managed Instances. The lack of DB Mail on Azure SQL DB Single Database or Amazon RDS is a major blocker to PaaS adoption. Now with Managed Instance, we can have PaaS and database mail.

Read on for the instructions. There’s a little bit more than what you typically would need to do on-premises, but just a little bit.

Comments closed

Integrating Power BI into Azure Synapse Analytics

Ginger Grant walks us through two methods of integrating Power BI and Azure Synapse Analytics:

From within Synapse you have the ability to access a Power BI workspace so that you can use Power BI from within Synapse.  Your Power BI tenant can be in a different data center than the Azure Synapse Workspace, but they both must be in the same Power BI Tenant.  You can use Power BI to look at any data you wish, as the data you use can be from any location. When this blog was written, it was only possible to connect to one Power BI workspace from within Azure Synapse. In order to run Power BI as shown here, first I needed to create a Linked Service from within Synapse.

Read on for more.

Comments closed

Adding Row Numbers to ADF Data Flows

Rayis Imayev shows two methods of generating unique, ascending row numbers in Azure Data Factory data flows:

Adding a row number to your dataset could a trivial task. Both ANSI and Spark SQL have the row_number() window function that can enrich your data with a unique number whole for your whole or partitioned data recordset. 

Recently I had a case of creating a data flow in Azure Data Factory (ADF) where there was a need to add a row number.

Read on for a couple attempts which didn’t work, followed by two that do, including an assist from Joseph Edwards.

Comments closed