Press "Enter" to skip to content

Category: Cloud

Deleting in Azure Data Factory

Meagan Longoria is happy that Azure Data Factory v2 now has a Delete activity:

It is a common practice to load data to blob storage or data lake storage before loading to a database, especially if your data is coming from outside of Azure. We often create a staging area in our data lakes to hold data until it has been loaded to its next destination. Then we delete the data in the staging area once our subsequent load is successful. But before February 2019, there was no Delete activity. We had to write an Azure Function or use a Logic App called by a Web Activity in order to delete a file. I imagine every person who started working with Data Factory had to go and look this up.

But now Data Factory V2 has a Delete activity.

Meagan shows how it works, what kinds of parameters you can set, and a couple of gotchas, so check it out.

Comments closed

Querying CosmosDB with PolyBase

Hasan Savran shows how you can query CosmosDB with T-SQL statements:

SQL Server Polybase Services lets us to pull data from other data resources by using T-SQL queries. SQL Server 2019 introduces new connectors for Polybase services like Oracle, Teradata and MongoDB. In one of the SQL Server 2019 presentations from Bob Ward, I saw the CosmosDB logo when he was talking about the new connectors of SQL Server 2019. CosmosDB already has an ODBC driver and you can use it as a datasource for your Power BI, SSIS or SSMS. SQL Server 2019 makes this connection easier by using the Polybase services.
      
      In this post, I am going to show you how to configure SQL Server 2019 Polybase to connect Azure CosmosDB Mongo databases

I have a bit of a vested interest in this, so it is heartening to see people trying out PolyBase. It’s more heartening when they use it after they try it out.

Comments closed

Restoring Databases From Azure

John Morehouse shows how we can restore a database from Azure Blob Storage:

So how do you restore from Azure storage? You do so from an URL.  Let’s take a look!

When you backup a database to Azure, there are two types of blobs that can be utilized, namely page and block blobs.   Due to price and flexibly, it is recommended to use block blobs.  However, depending on which type you used to perform the backup will dictate how the restores are performed.  Both methods require the use a credential, so that information will need to be known before being able to restore from Azure.

Click through for examples using both page blobs and block blobs.

Comments closed

Azure Data Factory Data Flows

Marlon Ribunal shows how we can perform some amount of data transformation in an Azure Data Factory V2 data flow:

Azure Data Factory (ADF) offers a convenient cloud-based platform for orchestrating data from and to on-premise, on-cloud, and hybrid sources and destinations. But it is not a full Extract, Transform, and Load (ETL) tool. For those who are well-versed with SQL Server Integration Services (SSIS), ADF would be the Control Flow portion.

You can scale out your SSIS implementation in Azure. In fact, there are two (2) options to do this: SSIS On-Premise using the SSIS runtime hosted by SQL Server or On Azure using the Azure-SSIS Integration Runtime.

Azure Data Factory is not quite an ETL tool as SSIS is. There is that transformation gap that needs to be filled for ADF to become a true On-Cloud ETL Tool. The second iteration of ADF in V2 is closing the transformation gap with the introduction of Data Flow.

Despite it not being nearly as complete as SSIS, there are useful data transformations available in Azure Data Factory, as Marlon shows.

Comments closed

Automating Azure Storage To Move Between Tiers

Ryan Adams built a process to save money on storage costs for a customer’s test environment:

One of the best things about Azure, and the cloud in general, is we can automate most anything, and we are going to look at how to automate Azure VM Storage.  This allows us to come up with some outside-of-the-box solutions.  I had a customer with a road block that we were able to work around by automating some things with their Azure Virtual Machines.

Their challenge was that they wanted to move their test and development environments to Azure, but the storage cost was prohibitive.  They needed premium storage to mimic their production environment, but it was not financially viable for test and development so they were going to keep it all on premises.  During our conversations I learned that they only test between 8am and 5pm, Monday through Friday.  My suggestion was that we put their databases on cheaper storage during off times and only premium when they are actively using it.

This doesn’t look like a one-hour task but if you’re in need of some cost savings on storage in non-production environments, check out Ryan’s scripts.

Comments closed

Improving Spark Auto-Scaling On ElasticMapReduce

Udit Mehrotra explains some of the ways Amazon ElasticMapReduce reduces the pain of node loss in Spark jobs:

The Automatic Scaling feature in Amazon EMR lets customers dynamically scale clusters in and out, based on cluster usage or other job-related metrics. These features help you use resources efficiently, but they can also cause EC2 instances to shut down in the middle of a running job. This could result in the loss of computation and data, which can affect the stability of the job or result in duplicate work through recomputing.

To gracefully shut down nodes without affecting running jobs, Amazon EMR uses Apache Hadoop‘s decommissioning mechanism, which the Amazon EMR team developed and contributed back to the community. This works well for most Hadoop workloads, but not so much for Apache Spark. Spark currently faces various shortcomings while dealing with node loss. This can cause jobs to get stuck trying to recover and recompute lost tasks and data, and in some cases eventually crashing the job. 

Auto-scaling doesn’t always mean scaling up.

Comments closed

Learning More About Azure Data Lake Storage Gen2

Melissa Coates shares a compendium of links pertaining to Azure Data Lake Storage Generation 2:

A couple of people have asked me recently about how to ‘bone up’ on the new data lake service in Azure. The way I see it, there are two aspects: A, the technology itself and B, data lake principles and architectural best practices. Below are some links to resources that you should find helpful.

There’s a lot of good stuff there.

Comments closed

Testing Cosmos DB’s REST API

Hasan Savran shows how we can test Cosmos DB’s REST API using Postman:

        You have many options to access to CosmosDB. Rest API is one of these options and it is the low level access way to Cosmos DB. You can customize all options of CosmosDB by using REST API. To customize the calls, and pass the required authorization information, you need to use http headers. There are many headers you can set depending on the operation you want to run in CosmosDB.  I am going to cover only the required headers here.

      In the following example, I am going to try to create a database in CosmosDB emulator by using the REST API. First let’s look at the required header fields for this request. These requirement applies to all other REST API calls too.

It’s a little more complicated than just posting to a URL and Hasan has you covered.

Comments closed

Azure Data Lake Store Gen2

James Serra gives us the low-down on Azure Data Lake Store Gen2 now that it is generally available:

When to use Blob vs ADLS Gen2
New analytics projects should use ADLS Gen2, and current Blob storage should be converted to ADLS Gen2, unless these are non-analytical use cases that only need object storage rather than hierarchical storage (i.e. video, images, backup files), in which case you can use Blob Storage and save a bit of money on transaction costs (storage costs will be the same between Blob and ADLS Gen2 but transaction costs will be a bit higher for ADLS Gen2 due to the overhead of namespaces).

Looks like there are still some things missing from Gen2, so don’t automatically jump on an upgrade. Read the documentation first to make sure you aren’t relying on something which isn’t there yet.

Comments closed

Combining Stream Analytics And Azure ML With Power BI

Brad Llewellyn shows us how to feed Azure ML predictions into Power BI via Azure Stream Analytics:

Today, we’re going to talk about combining Stream Analytics with Azure Machine Learning Studio within Power BI.  If you haven’t read the earlier posts in this series, IntroductionGetting Started with R ScriptsClusteringTime Series DecompositionForecastingCorrelationsCustom R VisualsR Scripts in Query EditorPythonAzure Machine Learning Studio and Stream Analytics, they may provide some useful context.  You can find the files from this post in our GitHub Repository.  Let’s move on to the core of this post, Stream Analytics.

This post is going to build directly on what we created in the two previous posts, Azure Machine Learning Studio and Stream Analytics.  As such, we recommend that you read them before proceeding.

Read on for the demo.

Comments closed