Press "Enter" to skip to content

Category: Cloud

Deploying a Big Data Cluster with Azure Data Studio

Mohammad Darab shows how you can deploy a Big Data Cluster to Azure Kubernetes Service using Azure Data Studio:

A few months ago I posted a blog on deploying a BDC using the built-in ADS notebook. This blog post will go a bit deeper into deploying a Big Data Cluster on AKS (Azure Kubernetes Service) using Azure Data Studio (version 1.13.0). In addition, I’ll go over the pros and cons and dive deeper into the reasons why I recommend going with AKS for your Big Data Cluster deployments.

AKS does make it pretty easy. The toughest part for me was figuring out which instance types were supported—I tried a few which would save me money and they weren’t available. I do like that they added a check to view availability before completing the notebook; that wasn’t in the preview version.

Comments closed

New Features in Azure Synapse Analytics

James Serra gives us a bullet list of new features in Azure Synapse Analytics:

Almost lost in all the announcements from Ignite was a bunch of amazing new features that were added to the Provisioned Resources/SQL Pool section (read SQLDW functionalities) side of Azure Synapse Analytics (formally called Azure SQL Data Warehouse).

One of the more interesting options is ordered clustered columnstore indexes. That seems like something which would be nice to have on-prem. The segment elimination works on-prem today, but ordering is accidental at best. By that, I mean the way that SQL Server loads data into a CCI—roughly, in the order in which you insert it—is not guaranteed to work that way and could change in the future.

Comments closed

Errors with SQL Server TDE and Azure Key Vault

Amit Banerjee takes us through troubleshooting issues when using Azure Key Vault as the key storage mechanism for Transparent Data Encryption:

The first one was a 404 error. When I looked the application event log, I saw the following error:

Operation: getKeyByName
Key Name: ContosoRSAKey0
Message: [error:112, info:404, state:0] The server responded 404, because the key name was not found. Please make sure the key name exists in your vault.

The simple reason for the above error is that I was using an incorrect key name or the key didn’t exist in my Azure Key Vault. So the remediation is to check if the key exists in your Azure Key Vault. If not, then create the key.

Read on for additional errors you might run into, as well as a link to an Azure Data Studio notebook to set this up yourself.

Comments closed

Notes on Wrangling Data Flows

Rayis Imayev calculates distance between two geographical points in an Azure Data Factory Wrangling data flow:

Brian Donovan and Dan Work from the University of Illinois has pointed out that this dataset “contains a large number of errors. For example, there are several trips where the reported meter distances are significantly shorter than the straight-line distance, violating Euclidean geometry“. So, that triggered my interest to add an additional column to this dataset with a straight line distance between two geo-points of pickup and dropoff locations, and that’s where I wanted Wrangling Data Flows to help me.

Read on for Rayis’s demonstration, as well as a long list of observations (positive and negative) about the current state of Wrangling data flows.

Comments closed

Securing Data on ElasticMapReduce

Duncan Chen takes us through data encryption options when using ElasticMapReduce:

Data encryption is an effective solution to bolster data security. You can make sure that only authorized users or applications read your sensitive data by encrypting your data and managing access to the encryption key. One of the main reasons that customers from regulated industries such as healthcare and finance choose Amazon EMR is because it provides them with a compliant environment to store and access data securely.

This post provides a detailed walkthrough of two new encryption options to help you secure your EMR cluster that handles sensitive data. The first option is native EBS encryption to encrypt volumes attached to EMR clusters. The second option is an Amazon S3 encryption that allows you to use different encryption modes and customer master keys (CMKs) for individual S3 buckets with Amazon EMR.

Click through for more details on each.

Comments closed

Databricks + Azure Synapse Analytics

David Meyer and Clinton Ford explain how you can integrate Azure Databricks with Azure Synapse Analytics:

In the last two years since it first became available, thousands of companies have adopted Azure Databricks, making it one of the fastest growing data and AI services on Microsoft Azure. Customers now process over 2 exabytes per month with millions of server-hours spinning up every day. All of this is driven by organizations like ElectroluxShell, and renewables.AI that are using Azure Databricks to process data at massive scale for data science and analytics.

Within this amazing adoption is a specific solution architecture to highlight called the Modern Data Warehouse (MDW). Earlier this year we wrote about the performance and scale benefits of this solution, and part of the pattern’s success has been our close integration to Azure SQL Data Warehouse with a high-performance connector that was jointly engineered to make it fast and easy to move data between the two services.

Something interesting about Synapse is that its implementation of Spark is not the same as the Databricks implementation (perhaps for licensing reasons). But that doesn’t stop us from using Databricks to process and curate data for Synapse Analytics.

Comments closed

Azure Synapse Analytics, Nee Azure SQL DW

John Macintire explains Azure Synapse Analytics:

A cloud native, distributed SQL processing engine is at the foundation of Azure Synapse and is what enables the service to support the most demanding enterprise data warehousing workloads. This week at Ignite we introduced a number of exciting features to make data warehousing with Azure Synapse easier and allow organizations to use SQL for a broader set of analytics use cases.

There’s a fair amount of marketing-speak in here, but the gist is Azure SQL Data Warehouse + Spark + on-demand serverless queries (so you can, among other things, write T-SQL against your HDFS data). I think it has a better chance of long-lasting success than Azure SQL Data Warehouse.

Comments closed

Debugging Azure Data Factory Data Flows

Mark Kromer takes us through debugging Azure Data Factory Data Flows:

When you are designing your mapping data flows in ADF, you are working against a live Azure Databricks Spark cluster. The size of that cluster is configurable via the Azure Integration Runtime. If you do not configure a custom Azure IR, then you will use the default Azure IR. That sets a very small cluster size by default of 4 cores for a single worker node and 4 cores for a single driver node. In most cases, while debugging and using data preview, that should be fine. But when you start exploring your data with column statistics or increase the sampling size in debug settings, you may find that you’ve exceeded the capacity on that small default cluster. Below are the steps you need to take to increase the size of your debug cluster.

Click through for step-by-step instructions.

Comments closed