Press "Enter" to skip to content

Category: Cloud

Building an ARM Template for Azure Data Factory

Andy Leondard takes the first steps to building an Azure Data Factory pipeline using Azure Resource Manager Templates:

Azure Resource Manager, or ARM, “allows you to provision your applications using a declarative template.” So says the Azure Quickstart Templates page. ARM templates are JSON and allow administrators to import and export Azure resources using varying management patterns. I really like ARM templates for implementing infrastructure as code in Azure. In this post I show a very simple example of how to use ARM templates to export and then import a basic ADF (Azure Data Factory) pipeline.

The sample code doesn’t do that much by itself, but it does open up a new world of automation.

Comments closed

Time Series Modeling with Gluon

Jan Gasthaus, et al, announce a new open source product release:

We are excited to announce the open source release of Gluon Time Series (GluonTS), a Python toolkit developed by Amazon scientists for building, evaluating, and comparing deep learning–based time series models. GluonTS is based on the Gluon interface to Apache MXNet and provides components that make building time series models simple and efficient.

In this post, I describe the key functionality of the toolkit and demonstrate how to apply GluonTS to a time series forecasting problem.

It looks interesting.

Comments closed

An Introduction to Azure Databricks

Brad Llewellyn has an introduction to Azure Databricks:

So, what is Azure Databricks?  To answer this question, let’s start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

It’s Hadoop turtles all the way down.

Comments closed

Using Notebooks with ElasticMapReduce

Vignesh Rajamani and Nikki Rouda show off ElasticMapReduce Notebooks:

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Notebooks are everywhere. And for good reason.

Comments closed

Unique Key Constraints in Cosmos DB

Hasan Savran shows how you can set unique key constraints on Cosmos DB containers:

Unique key names are case-sensitive, I have good experience on this. If your unique key is in lowercase letters but your data has field with uppercase, CosmosDB will insert null value into unique key first time, you will get an error second time when it tries to insert null again. CosmosDB does not support sparse unique keys. If your unique key is /SSN, you can have only one null value in this field.

    If you like to use unique keys in Azure CosmosDB, you have to them when you create your containers. You cannot add a unique key to an existing container. Only way to add a unique key to an existing container is, to create a new container and move your data from older container to the new one. Also, you cannot update unique keys just like partition keys. Picking a wrong unique key can be an expensive error.

Looks like you’ll need to have a bit of foresight when choosing keys (or choosing not to use keys).

Comments closed

Data Classifications on Azure SQL DW

Meagan Longoria takes us through data classifications on Azure SQL Data Warehouse:

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

But read the whole thing, as Meagan sees a problem with it when you use a popular loading technique.

Comments closed

Populating a Data Vault Model with Azure Data Factory

Rayis Imayev gives us an example of ELT into a Data Vault model using Azure Data Factory:

To make a full transition from the existing  DW model to an alternative Data Vault I removed all Surrogate Keys and other attributes that are only necessary to support Kimball data warehouse methodology. Also, I needed to add necessary Hash keys to all my Hub, Link and Satellite tables. The target environment for my Data Vault would be SQL Azure database and I decided to use a built-in crc32 function of the Mapping Data Flow to calculate hash keys (HK) of my business data sourcing keys and composite hash keys of satellite tables attributes (HDIFF).

Data Vault is somewhere on my list of things to learn. It’s not at the top of the list, but that’s not a slight against it.

Comments closed

Auditing Azure Analysis Services

Kasper de Jonge shows how you can audit an Azure Analysis Services cube:

So the question was: how can I see who connected to my AS Azure database and what queries where send? Initially I thought of ways I used to do this in the on premises world. Capture profiler traces or XEvents by writing code and then store it somewhere for processing. It looks like was not alone in these, even the AS team itself had ways to capture XEvents and store them: https://azure.microsoft.com/en-us/blog/using-xevents-with-azure-analysis-services/

But it turns out it is much more smooth, simple and elegant by leveraging Azure’s own products. In this case we will be using Azure Log Analytics. It already documented in the official documentation here.

Click through for a demo.

Comments closed

Amazon Redshift ETL Tips

The Blendo team shares a few tips around ETL’ing data to Amazon Redshift:

2. The WLM Method
Use Amazon Redshift’s WLM (workload management) for defining a dedicated queue for the ETL process. Configuring the ETL queue with a small number of slots will help in avoiding excessive COMMITs. Also, avoid COMMITing separately for each transaction since commits are expensive.
Instead, surround multiple steps of the ETL process by a BEGIN…END statement. You can perform COMMIT only after all transformation logic is executed.

Click through for the set of tips.

Comments closed

SQL Server Settings Blade in Azure

Dave Bermingham notes a recent change to the Azure Portal when creating a new VM with SQL Server pre-installed:

As you slide the IOPS slider to the right you will see the number of data disks increase, the Storage Size increase, and the Throughput increase. You will be limited to the max number of IOPS and disks supported by that instance size. You see in the screenshot below I am able to go as high as 80,000 IOPS when provisioning storage for a Standard E64-16s_v3 instance.

It sounds like they did a pretty good job of things there.

Comments closed