Press "Enter" to skip to content

Category: Cloud

Provisioning An Azure SQL Managed Instance

Frank Gill walks us through the process of provisioning an Azure SQL Managed Instance:

Once you have created the prerequisites, you are ready to create your first Managed Instance.  As of now, Managed Instance is only available in the following subscription types:

  • Pay-As-You-Go
  • Enterprise Agreement
  • Cloud Service Provider

Information about subscription and resource limitations can be found here.  I will update this with any changes.

Frank has a series of screenshots to show you the way.

Comments closed

Change Data Capture With Databricks Delta

Ameet Kini and Denny Lee show how to use Databricks Delta to handle change data capture from different processes:

With Databricks Delta, the CDC pipeline is now streamlined and can be refreshed more frequently: Informatica => S3 => Spark Hourly Batch Job => Delta. In this scenario, Informatica writes change sets directly to S3 using Informatica’s Parquet writer. Databricks jobs run at the desired sub-nightly refresh rate (e.g., every 15 min, hourly, every 3 hours, etc.) to read these change sets and update the target Databricks Delta table.

With minor changes, this pipeline has also been adapted to read CDC records from Kafka, so the pipeline there would look like Kafka => Spark => Delta. In the rest of this section, we elaborate on this process, and how we use Databricks Delta as a sink for their CDC workflows.

With one of our customers, we implemented these CDC techniques on their largest and most frequently refreshed ETL pipeline. In this customer scenario, Informatica writes a change set to S3 for each of its 65 tables that have any changes every 15 minutes.   While the change sets themselves are fairly small (< 1000 records), their target tables can become quite large. Out of the 65 tables, roughly half a dozen are in the 50m-100m row range, and the rest are smaller than 50m. In Oracle, this pipeline would have run every 15 minutes, keeping in sync with Informatica. In Databricks Delta, we thought this would take close to an hour due to S3 latencies but ended up being pleasantly surprised with a 30 and even 15-minute refresh period depending on cluster size.

Click through for the rest of the story.

Comments closed

MariaDB On Azure Now In Preview

Chris Seferlis fills us in on Microsoft’s latest Platform-as-a-Service database offering:

First, a little history on MariaDB. MariaDB is a community developed fork of the MySQL. Essentially, when Oracle purchased MySQL from Sun, some of the developers from MySQL were concerned that the acquisition would make changes or lead down the road where it would no longer be open source.

So, they went ahead and forked off to MariaDB with the intent to maintain high compatibility with MySQL. Also, the contributors are required to share their copyright with MariaDB foundation rights, which in a nutshell means they want this foundation to always be open source.

This joins SQL Server, MySQL, and Postgres as PaaS database offerings in Azure.

Comments closed

Using IO Cache To Speed Up Spark Jobs

Chris Seferlis looks at what the HDInsight team has done to speed up Apache Spark jobs:

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Read on for more details.

Comments closed

Creating Firewall Rules With Azure Cloud Shell

Kellyn Pot’vin-Gorman shows how you can add a firewall rule for Azure SQL Database from the Azure Cloud Shell:

With my use of scripting and Azure Cloud Shell, I’m automating and building my environment, including SQL Database resources and then have a requirement to access and build the logical objects.  This means that I need a firewall rule build for the Azure Cloud Shell I’m working from.  The IP for this cloud shell is unique to the session I’m running at that moment.

The requirement to add this enhancement to my script is:

  1. Capture and read the IP Address for the Azure Cloud shell session.

  2. Populate the IP Address to a Firewall rule

  3. Log into the new SQL Server database that was created as part of the bash script and then execute SQL scripts.

Click through for instructions.

Comments closed

Azure SQL Database Hyperscale Tier

Chris Seferlis looks at a new service tier offering for Azure SQL Database:

The Hyperscale service tier provides the following capabilities:

  • Support for up to 100 terabytes of database size (and this will grow over time)

  • Faster large database backups which are based on file snapshots

  • Faster database restores (also based on file snapshots)

  • Higher overall performance due to higher log throughput and faster transaction commit time regardless of the data volumes

  • The ability to rapidly scale out. You can provision one or more read only nodes for offloading your read workload for use as hot standbys.

  • You can rapidly scale up your compute resources (in constant time) to accommodate heavy workloads, so you can scale compute up and down as needed just like Azure Data Warehouse

At what cost?  I like Chris’s “not inexpensive” understatement here.

Comments closed

Spark Streaming On Azure Databricks

Tristan Robinson shows us how to run Spark Streaming within Azure Databricks:

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

  • Azure Stream Analytics

  • Spark Streaming / Storm on HDInsight

  • Spark Streaming on Databricks

  • Azure Functions

Click through for more.

Comments closed

Azure SQL Managed Instance Prerequisites

Frank Gill has started a series on Azure SQL Managed Instances and has two posts up already.  First, an introduction:

The drawbacks of Azure SQL Database make it difficult to migrate existing applications, because of the number of application changes required.  Azure SQL Database is designed to be used for new development in Azure and for multi-tenant environments, where each tenant requires their own copy of a database.

The benefits of SQL Server on an Azure VM make it much easier to migrate an existing application to Azure.  However, the VMs underlying the application still have to be managed by the client.  This fails to take advantage of the management of resources in Azure, and uses Azure as a VM host.

A third option, Azure SQL Managed Instance, was released at the beginning of October 2018.  Managed Instance combines the best of the previous options.  With Managed Instance, the infrastructure is fully managed and the majority of the SQL Server feature set is available.  The full list of differences between a traditional install of SQL Server and Managed Instance can be found here.  A number of the most dramatic differences are listed below.

Then a post covering pre-requisites:

Before creating an Azure SQL Managed Instance, a number of prerequisite resources must be provisioned.  These are:

  • An Azure Virtual Network

  • A dedicated subnet for Managed Instances

  • A route table

It looks like this is part of a longer series Frank is building out, so stay tuned.

Comments closed

112 Million Cab Rides In Azure SQL Data Warehouse

Derik Hammer wants a real test of Azure SQL Data Warehouse:

The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.

I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.

Read on to see how you can set it up yourself.  As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.

Comments closed

Looking At Databricks Cluster Pricing

Tristan Robinson takes a look at Azure Databricks pricing:

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost.  In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

There are a few complicating factors in figuring out cluster price but rest assured that it will be costly.

Comments closed