Press "Enter" to skip to content

Category: Cloud

Debugging Azure Data Factory Data Flows

Mark Kromer takes us through debugging Azure Data Factory Data Flows:

When you are designing your mapping data flows in ADF, you are working against a live Azure Databricks Spark cluster. The size of that cluster is configurable via the Azure Integration Runtime. If you do not configure a custom Azure IR, then you will use the default Azure IR. That sets a very small cluster size by default of 4 cores for a single worker node and 4 cores for a single driver node. In most cases, while debugging and using data preview, that should be fine. But when you start exploring your data with column statistics or increase the sampling size in debug settings, you may find that you’ve exceeded the capacity on that small default cluster. Below are the steps you need to take to increase the size of your debug cluster.

Click through for step-by-step instructions.

Comments closed

Creating a SQL Managed Instance

Jess Pomfret takes us through creation of a SQL Managed Instance:

I’ve been thinking about the cloud a lot lately, and I feel it’s an area that I would benefit from learning more about. I’ve attended a couple of presentations on SQL Managed Instances and have read enough to be dangerous (or accidentally spend a lot of money, one of my biggest fears when working in the cloud). However, I always find I learn best and really get to understanding a topic by building something.

This post will be the first in at least a two part series on SQL Managed Instances (MI). My goal in this post is just to deploy an MI and have it ready to use for my next post.

Read on for the step-by-step instructions.

Comments closed

Azure AD Logins for Managed Instances

Mirek Sztajno announces a new feature for Azure SQL Managed Instances:

We are happy to announce a general availability (GA) for Azure AD server principals (Azure AD logins) for SQL managed instance (MI). This feature allows Azure AD users to create logins in the master database for MI, grant MI server level permissions for these logins and create Azure AD users with     logins for individual MI databases.

Additionally, enabling Azure AD logins allow users to execute many MI features supported for SQL logins (see the documentation at the end of this blog).

Read on to learn more about this feature.

Comments closed

Using Lenses and GitOps to Migrate Kafka to HDInsight

Andrew Stevenson takes us through migrating from a self-managed Kafka cluster on Lenses to HDInsight using GitOps:

Let’s dig deeper with an example. I have a Self-Managed Kafka cluster and I want to migrate to HDInsight Kafka.

First, we will concentrate on topics. I may have 1000’s of topics. How do I ensure that the configuration (the metadata) are migrated efficiently?

I could do this manually, but this is error-prone, time-consuming and importantly also lacks governance and auditing. A better approach would be to automate this, which is what we can achieve with Lenses and a GitOps approach.

Click through to see how to automate this.

Comments closed

Embedding SSIS Packages in Azure Data Factory Pipelines

Andy Leonard shows us how to embed an SSIS package inside Azure Data Factory pipelines:

The Azure-SSIS Team has done it again; they’ve added more cool SSIS execution functionality to Azure Data Factory!

Click through to see what has Andy excited. I think this is a big thing for ADF as well, especially in shops which dedicated a lot of time and energy into building SSIS packages for ETL work over the years.

Comments closed

Offset and Limit with Cosmos DB

Hasan Savran takes us through the OFFSET and LIMIT clauses in Cosmos DB:

OFFSET LIMIT clause one of the latest additions to the Azure Cosmos DB. Skip/Take function was a big request from users and Cosmos DB team listened users and deliver this functionality. If you think Cosmos DB is missing a feature and if you have a new idea, you can use Feedback Forums to give feedback to Cosmos Db team.


     OFFSET LIMIT clause let you skip x number of results then take y numbers of values from the query. Count for OFFSET and Limit are integer and both are required. In other words, You must use LIMIT if you use OFFSET.

A common use for this is paging. I’d be interested to see if this shares the issues that the SQL Server version has: you may only return back 20 rows, but you’re potentially scanning N + 20 each time.

Comments closed

Incremental Data Migration to Blob Storage

Ginger Daniel has started a series on data migration into Azure Blob Storage:

Part 1 of this article demonstrates how to upload multiple tables from an on-premise SQL Server to an Azure Blob Storage account as csv files.  I covered these basic steps to get data from one place to the other using Azure Data Factory, however there are many other alternative ways to accomplish this, and many details in these steps that were not covered.  For a deep-dive into the details you can start here https://docs.microsoft.com/en-us/azure/data-factory/introduction, and https://docs.microsoft.com/en-us/azure/data-factory/quickstart-create-data-factory-portal#create-a-pipeline

Part 1 was chock full of information, and it looks like Part 2 will be as well.

Comments closed

Azure AD Credential Passthrough and Databricks

Anna Shrestinian, et al, explain how Azure Databricks enables Azure Active Directory credential passthrough when working with Azure Data Lake Storage Gen2:

Azure Data Lake Storage (ADLS) Gen2, which became generally available earlier this year, is quickly becoming the standard for data storage in Azure for analytics consumption. ADLS Gen2 enables a hierarchical file system that extends Azure Blob Storage capabilities and provides enhanced manageability, security and performance.

The hierarchical file system provides granular access control to ADLS Gen2. Role-based access control (RBAC) could be used to grant role assignments to top-level resources and POSIX compliant access control lists  (ACLs) allow for finer grain permissions at the folder and file level. These features allow users to securely access their data within Azure Databricks using the Azure Blob File System (ABFS) driver, which is built into the Databricks Runtime.

There are some tradeoffs involved, particularly around using High Concurrency clusters (or limiting yourself to one user account), but it’s a nice bit of added value when you’re a heavy Azure user.

Comments closed

Ordering in Cosmos DB Queries

Hasan Savran shows how you can order data in Cosmos DB queries:

If you need to use multiple properties in your ORDER BY then you need to define COMPOSITE INDEXES.For example when I try to run the following query and try to order the objects by CreatedOn and Score, I end up with an error because I do not have a COMPOSITE INDEX to use with this ORDER BY.

Many parts of Cosmos DB’s SQL syntax are similar to T-SQL, but some of the underlying assumptions—such as, what you need to order data—are quite different.

Comments closed