Press "Enter" to skip to content

Category: Cloud

Recommendations for Implementing Azure Data Factory

Paul Andrew has a nice set of recommendations you should follow when configuring Azure Data Factory:

Building on our understanding of generic datasets, a good Data Factory should include (where possible) generic pipelines, these are driven from metadata to simplify (as a minimum) data ingestion operations. Typically I use an Azure SQLDB to house my metadata with stored procedures that get called via Lookup activities to return everything a pipeline needs to know.

This metadata driven approach means deployments to Data Factory for new data sources are greatly reduced and only adding new values to a database table is required. The pipeline itself doesn’t need to be complicated. Copying CSV files from a local file server to Data Lake Storage could be done with just three activities, shown below.

There are several good recommendations here; read the whole thing.

Comments closed

Managing On-Prem Data Gateways

Jamie Wick has some advice for managing Azure on-premises data gateways:

When organizations become aware of the On-premises Data Gateway application, there can be an explosion in the number of gateways that are connected to the tenant. Some people will invariably install a ‘temporary’ gateway on their local computer while they investigate/develop a new PowerBI report, PowerApp or Azure Automation process. All of these ‘temporary’ gateway instances can create problems when the systems that use them are moved into production and remain connected to the ‘temporary’ gateways.

In a previous post, I wrote about how to install and use an On-premises Data Gateway to connect on-site data with Azure-based apps (PowerBI, PowerApps, etc). Now we’re going to look at how to curtail the sprawl of gateways that may be occurring and how to remove any existing gateways that are deemed to be unnecessary.

Sprawl happens, but Jamie shows you how to limit it.

Comments closed

Jupyter Notebooks and Cosmos DB

Hasan Savran shows how we can use Jupyter notebooks with Cosmos DB:

After you enable the Notebook options, you are ready to analyze or visualize your data thanks to Python language and Python packages. Cosmos DB makes your life easy to write Python and install custom packages to use with your data. There are couple of great internal commands and wildcards you should know if you like to use Notebooks in Azure Cosmos DB. First one I want to introduce you is, %%sql command. This command lets you select data from your containers by using SQL API. You can select data and add it to your Python data frames. You need to define which database and container you want to use before you pass your query. Here is an example. In the following example, I want to use my database named Stackoverflow, and container named Posts. Then I pass the query.

These are internal notebooks, meaning no separate Jupyter server required. There’s a separate way of learning the Cosmos API from external notebooks.

Comments closed

Data Copy & Package Execution in ADF

Cathrine Wilhelmsen continues a series on Azure Data Factory. First, we get to see how to copy data from on-prem SQL Servers:

In the previous post, we looked at the three different types of integration runtimes. In this post, we will first create a self-hosted integration runtime. Then, we will create a new linked service and dataset using the self-hosted integration runtime. Finally, we will look at some common techniques and design patterns for copying data from and into an on-premises SQL Server.

And when I say “on-premises”, I really mean “in a private network”. It can either be a SQL Server on-premises on a physical server, or “on-premises” in a virtual machine.

Then, we learn how to run SSIS packages in Azure Data Factory:

Two posts ago, we looked at the three types of integration runtimes and created an Azure integration runtime. In the previous post, we created a self-hosted integration runtime for copying SQL Server data. In this post, we will complete the integration runtime part of the series. We will look at what SSIS Lift and Shift is, how to create an Azure-SSIS integration runtime, and how you can start executing SSIS packages in Azure Data Factory.

I’m going to guess that the next post will be all about the third integration runtime.

Comments closed

Schiphol Takeoff: Low-Code Automated Deployment

Tim van Cann and Daniel van der Ende have an open source project for automatic deployment on Azure:

To give a bit more insight into why we built Schiphol Takeoff, it’s good to take a look at an example use case. This use case ties a number of components together:

– Data arrives in a (near) real-time stream on an Azure Eventhub.
– A Spark job running on Databricks consumes this data from Eventhub, processes the data, and outputs predictions.
– A REST API is running on Azure Kubernetes Service, which exposes the predictions made by the Spark job.

Conceptually, this is not a very complex setup. However, there are quite a few components involved:

– Azure Eventhub
– Azure Databricks
– Azure Kubernetes Service

Each of these individually has some form of automation, but there is no unified way of coordinating and orchestrating deployment of the code to all at the same time. If, for example, you were to change the name of the consumer group for Azure Eventhub, you could script that. However, you’d also need to manually update your Spark job running on Databricks to ensure it could still consume the data.

This looks pretty nice. I’ll need to dive into it some more.

Comments closed

Choosing the Right Azure VM Type for SQL Server

Glenn Berry walks us through Azure virtual machine classes and picks out good ones for running SQL Server:

For high performance OLTP SQL Server workloads, the memory optimized type of Azure VMs is usually the best choice. According to Microsoft, “Memory optimized VM sizes offer a high memory-to-CPU ratio that are great for relational database servers.” This gives you lower core counts, with more memory, which is usually what you want for SQL Server, to minimize your license costs and still have good performance.

You can go even further down this path with Constrained vCPU capable VM sizes, where you can constrain the VM vCPU count (to one half or one quarter of the original VM size) to reduce the cost of SQL Server licensing, while maintaining the same memory, storage, and I/O bandwidth as a non-constrained VM. These constrained Azure VMs have a suffix in the name that indicates the number of active vCPUs in the VM.

Click through for the comparison.

Comments closed

More with Azure Data Factory

Cathrine Wilhelmsen continues a series on Azure Data Factory. Since we left off, Cathrine has three new posts. First, a look at how we monitor Azure Data Factory pipelines:

In the previous post, we looked at the three different trigger types, as well as how to trigger pipelines on-demand. In this post, we will look at what happens after that. How does monitoring work in Azure Data Factory?

Now, if we want to look at monitoring, we probably need something to monitor first. I mean, I could show you a blank dashboard, but I kind of already did that, and that wasn’t really interesting at all 🤔 So! In the previous post, I created a schedule trigger that runs hourly, added it to my orchestration pipeline, and published it.

Second, using annotations to customize views:

Annotations are additional, informative tags that you can add to specific factory resources: pipelinesdatasetslinked services, and triggers. By adding annotations, you can easily filter and search for specific factory resources.

You need to figure out what kind of annotations make sense to you. Maybe you want to filter on the different logical steps of your solution, so you add the tags extract and transform? Perhaps ingest and prepare? Or maybe you want to tag resources with a business unit or a project name? It’s entirely up to you. All I recommend is that you’re consistent 🙂

That’s a problem for me—the only thing I’m consistent about is inconsistency. Third, Cathrine introduces the different runtimes available to us:

An integration runtime (IR) specifies the compute infrastructure an activity runs on or gets dispatched from. It has access to resources in either public networks, or in public and private networks.

Or, in Cathrine-speak, using less precise words: An integration runtime specifies what kind of hardware is used to execute activities, where this hardware is physically located, who owns and maintains the hardware, and which data stores and services the hardware can connect to.

There’s a lot of good material in each of these three posts.

Comments closed

Azure Data Factory Triggers

Cathrine Wilhelmsen continues a series on Azure Data Factory by looking at triggers:

One important thing to note is that all times are in UTC. And since UTC does not observe daylight saving time… Well, let’s just say that if you need to execute pipelines during the workday and you have business users waiting for data, you may want to plan some trigger maintenance on the days when you fall back or spring forward. I know. Ugh 🙂 I’m hoping for better timezone support in the future 🤞🏻

Schedule triggers and pipelines have a many-to-many relationship. That means that one schedule trigger can execute many pipelines, and one pipeline can be executed by many schedule triggers.

Time-based triggers aren’t the only options, however—Cathrine also looks at the other three possibilities.

Comments closed

Debugging Azure Data Factory Pipelines

Cathrine Wilhelmsen shows us how to debug Azure Data Factory pipelines:

You debug a pipeline by clicking the debug button:

Tadaaa! Blog post done? 😀

I joke, I joke, I joke. Debugging pipelines is a one-click operation, but there are a few more things to be aware of. In the rest of this post, we will look at what happens when you debug a pipeline, how to see the debugging output, and how to set breakpoints.

Turns out there’s more to it than clicking a button.

Comments closed

Orchestrating ADF Pipelines

Cathrine Wilhelmsen continues a series on Azure Data Factory:

The other way to build this solution is by creating an orchestration pipeline with two execute pipeline activities. This gives us a little more flexibility than having a single pipeline, because we can execute each pipeline separately if we want to.

Let’s start by creating a new pipeline and adding two execute pipeline activities to it. In the activity settings, select the pipelines to execute, and check wait on completion:

Read on for the demonstration.

Comments closed