Press "Enter" to skip to content

Category: Cloud

An Introduction to Azure Databricks

Brad Llewellyn has an introduction to Azure Databricks:

So, what is Azure Databricks?  To answer this question, let’s start all the way at the bottom of the hole and climb up.  So, what is Hadoop?  Apache Hadoop is an open-source, distributed storage and computing ecosystem designed to handle incredibly large volumes of data and complex transformations.  It is becoming more common as organizations are starting to integrate massive data sources, such as social media, financial transactions and the Internet of Things.  However, Hadoop solutions are extremely complex to manage and develop.  So, many people have worked together to create platforms that layer on top of Hadoop to provide a simpler way to solve certain types of problems.  Apache Spark is one of these platforms.  You can read more about Apache Hadoop here and here.

It’s Hadoop turtles all the way down.

Comments closed

Using Notebooks with ElasticMapReduce

Vignesh Rajamani and Nikki Rouda show off ElasticMapReduce Notebooks:

One of the useful features of EMR Notebooks is the separation of the notebook environment from your underlying cluster infrastructure. The separation makes it easy for you to execute notebook code against transient clusters without worrying about deploying or configuring your notebook infrastructure every time you bring up a new cluster. You can create multiple serverless notebooks from the AWS Management Console for EMR and access the notebook UI without spending time setting up SSH access or configuring your browser for port-forwarding. Each notebook you create is launched instantly with its own Spark context. This capability enables you to attach multiple notebooks to a single shared cluster and submit parallel jobs without fear of job conflicts in a multi-tenant environment. This way you make efficient use of your clusters.

You can also connect EMR Notebooks to an EMR cluster as small as a one node. This gives you a budget-friendly sandbox environment to develop your Spark application.

Notebooks are everywhere. And for good reason.

Comments closed

Unique Key Constraints in Cosmos DB

Hasan Savran shows how you can set unique key constraints on Cosmos DB containers:

Unique key names are case-sensitive, I have good experience on this. If your unique key is in lowercase letters but your data has field with uppercase, CosmosDB will insert null value into unique key first time, you will get an error second time when it tries to insert null again. CosmosDB does not support sparse unique keys. If your unique key is /SSN, you can have only one null value in this field.

    If you like to use unique keys in Azure CosmosDB, you have to them when you create your containers. You cannot add a unique key to an existing container. Only way to add a unique key to an existing container is, to create a new container and move your data from older container to the new one. Also, you cannot update unique keys just like partition keys. Picking a wrong unique key can be an expensive error.

Looks like you’ll need to have a bit of foresight when choosing keys (or choosing not to use keys).

Comments closed

Data Classifications on Azure SQL DW

Meagan Longoria takes us through data classifications on Azure SQL Data Warehouse:

Data classifications in Azure SQL DW entered public preview in March 2019. They allow you to label columns in your data warehouse with their information type and sensitivity level. There are built-in classifications, but you can also add custom classifications. This could be an important feature for auditing your storage and use of sensitive data as well as compliance with data regulations such as GDPR. You can export a report of all labeled columns, and you can see who is querying sensitive columns in your audit logs. The Azure Portal will even recommend classifications based upon your column names and data types. You can add the recommended classifications with a simple click of a button.

But read the whole thing, as Meagan sees a problem with it when you use a popular loading technique.

Comments closed

Populating a Data Vault Model with Azure Data Factory

Rayis Imayev gives us an example of ELT into a Data Vault model using Azure Data Factory:

To make a full transition from the existing  DW model to an alternative Data Vault I removed all Surrogate Keys and other attributes that are only necessary to support Kimball data warehouse methodology. Also, I needed to add necessary Hash keys to all my Hub, Link and Satellite tables. The target environment for my Data Vault would be SQL Azure database and I decided to use a built-in crc32 function of the Mapping Data Flow to calculate hash keys (HK) of my business data sourcing keys and composite hash keys of satellite tables attributes (HDIFF).

Data Vault is somewhere on my list of things to learn. It’s not at the top of the list, but that’s not a slight against it.

Comments closed

Auditing Azure Analysis Services

Kasper de Jonge shows how you can audit an Azure Analysis Services cube:

So the question was: how can I see who connected to my AS Azure database and what queries where send? Initially I thought of ways I used to do this in the on premises world. Capture profiler traces or XEvents by writing code and then store it somewhere for processing. It looks like was not alone in these, even the AS team itself had ways to capture XEvents and store them: https://azure.microsoft.com/en-us/blog/using-xevents-with-azure-analysis-services/

But it turns out it is much more smooth, simple and elegant by leveraging Azure’s own products. In this case we will be using Azure Log Analytics. It already documented in the official documentation here.

Click through for a demo.

Comments closed

Amazon Redshift ETL Tips

The Blendo team shares a few tips around ETL’ing data to Amazon Redshift:

2. The WLM Method
Use Amazon Redshift’s WLM (workload management) for defining a dedicated queue for the ETL process. Configuring the ETL queue with a small number of slots will help in avoiding excessive COMMITs. Also, avoid COMMITing separately for each transaction since commits are expensive.
Instead, surround multiple steps of the ETL process by a BEGIN…END statement. You can perform COMMIT only after all transformation logic is executed.

Click through for the set of tips.

Comments closed

SQL Server Settings Blade in Azure

Dave Bermingham notes a recent change to the Azure Portal when creating a new VM with SQL Server pre-installed:

As you slide the IOPS slider to the right you will see the number of data disks increase, the Storage Size increase, and the Throughput increase. You will be limited to the max number of IOPS and disks supported by that instance size. You see in the screenshot below I am able to go as high as 80,000 IOPS when provisioning storage for a Standard E64-16s_v3 instance.

It sounds like they did a pretty good job of things there.

Comments closed

Quick Hits on Managed Instance Backup / Restore

Jovan Popovic has some pieces of advice for backing up and restoring databases on Azure SQL Managed Instances:

Managed Instance takes automatic backups (full backups every week, differential every 12 hours, and log backups every 5-10 min) that you can use to restore a database to some point of time in past within the retention period, restore accidentally deleted database. For more information, see Automated backups. Managed Instance also enables you to restore a database from a backup file placed on Azure Blob Storage, backup a database to Azure Blob Storage. Managed Instance currently don’t support backup retention longer than 35 days, but you can use backups to blob storage as an alternative.

If you are experiencing some issues with any backup or restore operation, the following troubleshooting steps might help you to identify the issue.

Click through for those hints.

Comments closed

Azure SQL Database and Extended Events

Dave Bland shows how to set up and read an extended event file on Azure SQL Database:

This first step when using T-SQL to read Extended Files that are stored in an Azure Storage Account is to create a database credential.  Of course the credential will provide essential security information to connect to the Azure Storage Account.  This first data point you will need is the URL to a blog storage container in you storage account.  If you look below, you can see where you would place your storage account name and the blob storage container name.

Dave gives us the grand tour of the configuration process, including where things differ between on-prem SQL Server and Azure SQL Database (which is quite a bit)

Comments closed