Press "Enter" to skip to content

Category: Cloud

Microsoft.DataFactory and Storage Event Triggers in Synapse

Cathrine Wilhelmsen troubleshoots an Azure issue:

I ran into an issue today while trying to publish a storage event trigger in Azure Synapse Analytics. After publishing, I got error messages that said “failed to subscribe” and “failed to activate”. The storage event trigger had been published, but it wouldn’t start. Help!

Click through for some resources on documentation, a few things which didn’t work, and what finally resolved the issue.

Comments closed

Securely Access VMs with Azure Bastion

I have a post on Azure Bastion:

Azure Bastion is a service which acts as a managed RDP or SSH host, allowing you to use a web browser securely to connect to a virtual machine, even when that virtual machine does not have a public IP address. If you’re new to Azure networking, it may feel a little complicated, but let’s see how to configure and use Bastion.

Click through for a step-by-step guide on how to use the service.

Comments closed

Preventing Concurrent Pipeline Execution in Azure Data Factory

Dave Ruijter and Laura de Bruin want to prevent concurrent runs of a pipeline:

For scheduled triggers, there is nothing out-of-the-box that can help you to prevent concurrent pipeline runs. For tumbling window triggers there is a maxConcurrency property, but keep in mind that this will create a queue/backlog of pipeline runs. It will not cancel any pipeline runs. It depends on your use case if you really want that behavior. 

Instead, the two look at a pair of designs and this post is all about the first one.

Comments closed

Building a Pipeline for External Data Sharing

Hope Foley has data to share:

I worked with a customer recently who had a need to share CSVs for an auditing situation.  They had a lot of external customers that they needed to collect CSVs from for the audit process.  There were a lot of discussions happening on how to best do it, whether we’d pull data from their environment or have them push them into theirs.  Folks weren’t sure on that so I tried to come up with something that would work for both. 

Read on for Hope’s solution to the problem.

Comments closed

The User-Assigned Managed Identity in ADF

Asanka Padmakumara takes a look at the user defined managed identity:

If you are familiar with Managed Identity concepts in ADF, each ADF instance comes with own System Assigned Managed Identity (MI). We can use that MI to control ADF’s access to any data sources which support Azure AD based authentication. This is considered to be the most secured and recommended way of authenticating ADF with cloud systems. If not, you can use Azure Key vault to store credentials. Let’s take an example on to discuss how User Assigned Managed Identity helps for manage access within multiple ADF environment.

Click through to see how the user assigned managed identity makes life better.

Comments closed

ElasticMapReduce Serverless

Damon Cortesi, et al, announce serverless EMR is now in preview:

Today we’re happy to announce Amazon EMR Serverless, a new option in Amazon EMR that makes it easy and cost-effective for data engineers and analysts to run petabyte-scale data analytics in the cloud. With EMR Serverless, you can run applications built using open-source frameworks such as Apache Spark, Hive, and Presto, without having to configure, manage, optimize, or secure clusters. EMR Serverless automatically provisions and scales the compute and memory resources required by your applications, and you only pay for the resources that your applications use.

In this post, we discuss the benefits of EMR Serverless, walk you through the core concepts of EMR Serverless and how you can use it, and show you a quick demo.

If you’re already using EMR for ephemeral work—that is, using a Spark cluster to perform data transformations and then shutting it down—this makes a lot of sense as long as there’s not a major difference in cost.

Comments closed

Creating an Availability Group on Linux in Azure with Pacemaker

Andrew Pruski slams in all of the exciting nouns:

There are new Ubuntu Pro 20.04 images available in the Azure marketplace with SQL Server 2019 pre-installed so I thought I’d run through how to create a three node pacemaker cluster with these new images in order to deploy a SQL Server availability group.

Disclaimer – The following steps will create the cluster but will not have been tested in a production environment. Any HA configuration for SQL Server needs to be thoroughly tested before going “live”.

Click through to see how.

Comments closed

Using the Fail Activity in Azure Data Factory

Rayis Imayev thinks about failure:

Recently, Microsoft introduced a new Fail activity (https://docs.microsoft.com/en-us/azure/data-factory/control-flow-fail-activity) in the Azure Data Factory (ADF) and I wondered about a reason to fail a pipeline in ADF when my internal being tries very hard to make the pipelines successful once and for all. Yes, I understand a documented explanation that this activity can help to “customize both its error message and error code”, but why?

Click through for Rayis’s take. I’ll just be here cracking jokes about how Fail activities are banned in my code because I expect it to have a positive outlook on life.

Comments closed

Building an ETL Pipeline with Airflow and Containers

Nikita Vasilev needs to move some data:

Obviously, we can use one of the many ready-made ETL systems that implement the functions of loading information into the corporate data warehouse. Informatica PowerCenter, Oracle Data Integrator, SAP Data Services, Oracle Warehouse Builder, Talend Open Studio, Pentaho are just a sliver of off-the-shelf solutions. However, when it comes to large volumes of data at high speeds and Big Data infrastructure already in place, boxed solutions fall flat to satisfy your needs.

Therefore, Big Data pipelines require something like Apache Airflow. It’s an open-source set of libraries for developing, planning, and monitoring workflows. Airflow is written in Python and allows you to create and configure task chains both visually with a clear web-GUI and to write Python program code.

Click through for an example using Airflow with AWS’s Elastic Container Service.

Comments closed

Building an MLOps Workflow with SageMaker and GitLab

Lauren Mullennex, et al, build out some pipelines:

Machine learning operations (MLOps) are key to effectively transition from an experimentation phase to production. The practice provides you the ability to create a repeatable mechanism to build, train, deploy, and manage machine learning models. To quickly adopt MLOps, you often require capabilities that use your existing toolsets and expertise. Projects in Amazon SageMaker give organizations the ability to easily set up and standardize developer environments for data scientists and CI/CD (continuous integration, continuous delivery) systems for MLOps engineers. With SageMaker projects, MLOps engineers or organization administrators can define templates that bootstrap the ML workflow with source version control, automated ML pipelines, and a set of code to quickly start iterating over ML use cases. With projects, dependency management, code repository management, build reproducibility, and artifact sharing and management become easy for organizations to set up. SageMaker projects are provisioned using AWS Service Catalog products. Your organization can use project templates to provision projects for each of your users.

In this post, you use a custom SageMaker project template to incorporate CI/CD practices with GitLab and GitLab pipelines. You automate building a model using Amazon SageMaker Pipelines for data preparation, model training, and model evaluation. SageMaker projects builds on Pipelines by implementing the model deployment steps and using SageMaker Model Registry, along with your existing CI/CD tooling, to automatically provision a CI/CD pipeline. In our use case, after the trained model is approved in the model registry, the model deployment pipeline is triggered via a GitLab pipeline.

Click through for the step-by-step guide on how to do this.

Comments closed