Category: Cloud

The big news here is the recently released preview of HDInsight IO Cache, which is a new transparent data caching feature that provides customers with up to 9X performance improvement for Spark jobs, without an increase in costs.

There are many open source caching products that exist in the ecosystem: Alluxio, Ignite, and RubiX to name a few big ones. The IO Cache is also based on RubiX and what differentiates RubiX from other comparable caching products is its approach of using SSD and eliminating the need for explicit memory management. While other comparable caching products leverage the reservation of operating memory for caching the data.

Read on for more details.

Comments closed

Creating Firewall Rules With Azure Cloud Shell

Published 2018-10-26 by Kevin Feasel

Kellyn Pot’vin-Gorman shows how you can add a firewall rule for Azure SQL Database from the Azure Cloud Shell:

With my use of scripting and Azure Cloud Shell, I’m automating and building my environment, including SQL Database resources and then have a requirement to access and build the logical objects. This means that I need a firewall rule build for the Azure Cloud Shell I’m working from. The IP for this cloud shell is unique to the session I’m running at that moment.

The requirement to add this enhancement to my script is:

Capture and read the IP Address for the Azure Cloud shell session.
Populate the IP Address to a Firewall rule
Log into the new SQL Server database that was created as part of the bash script and then execute SQL scripts.

Click through for instructions.

Comments closed

Azure SQL Database Hyperscale Tier

Published 2018-10-26 by Kevin Feasel

Chris Seferlis looks at a new service tier offering for Azure SQL Database:

The Hyperscale service tier provides the following capabilities:

Support for up to 100 terabytes of database size (and this will grow over time)
Faster large database backups which are based on file snapshots
Faster database restores (also based on file snapshots)
Higher overall performance due to higher log throughput and faster transaction commit time regardless of the data volumes
The ability to rapidly scale out. You can provision one or more read only nodes for offloading your read workload for use as hot standbys.
You can rapidly scale up your compute resources (in constant time) to accommodate heavy workloads, so you can scale compute up and down as needed just like Azure Data Warehouse

At what cost? I like Chris’s “not inexpensive” understatement here.

Comments closed

Spark Streaming On Azure Databricks

Published 2018-10-25 by Kevin Feasel

Tristan Robinson shows us how to run Spark Streaming within Azure Databricks:

Real-time stream processing is becoming more prevalent on modern day data platforms, and with a myriad of processing technologies out there, where do you begin? Stream processing involves the consumption of messages from either queue/files, doing some processing in the middle (querying, filtering, aggregation) and then forwarding the result to a sink – all with a minimal latency. This is in direct contrast to batch processing which usually occurs on an hourly or daily basis. Often is this the case, both of these will need to be combined to create a new data set.

In terms of options for real-time stream processing on Azure you have the following:

Azure Stream Analytics
Spark Streaming / Storm on HDInsight
Spark Streaming on Databricks
Azure Functions

Click through for more.

Comments closed

Azure SQL Managed Instance Prerequisites

Published 2018-10-25 by Kevin Feasel

Frank Gill has started a series on Azure SQL Managed Instances and has two posts up already. First, an introduction:

The drawbacks of Azure SQL Database make it difficult to migrate existing applications, because of the number of application changes required. Azure SQL Database is designed to be used for new development in Azure and for multi-tenant environments, where each tenant requires their own copy of a database.

The benefits of SQL Server on an Azure VM make it much easier to migrate an existing application to Azure. However, the VMs underlying the application still have to be managed by the client. This fails to take advantage of the management of resources in Azure, and uses Azure as a VM host.

A third option, Azure SQL Managed Instance, was released at the beginning of October 2018. Managed Instance combines the best of the previous options. With Managed Instance, the infrastructure is fully managed and the majority of the SQL Server feature set is available. The full list of differences between a traditional install of SQL Server and Managed Instance can be found here. A number of the most dramatic differences are listed below.

Then a post covering pre-requisites:

Before creating an Azure SQL Managed Instance, a number of prerequisite resources must be provisioned. These are:

An Azure Virtual Network
A dedicated subnet for Managed Instances
A route table

It looks like this is part of a longer series Frank is building out, so stay tuned.

Comments closed

112 Million Cab Rides In Azure SQL Data Warehouse

Published 2018-10-23 by Kevin Feasel

Derik Hammer wants a real test of Azure SQL Data Warehouse:

The method that I liked the most and finally settled on was to use a public dataset. I wanted data which was skewed in real ways and did not require a lot of work to massage. Microsoft has a great listing of public datasets here.

I decided to go with the NYC Taxi and Limousine Commission (TLC) Trip Record Data. Data is available for most taxi and limousine fares with pickup/drop-off and distance information between January 2009 and June 2018. This includes data for Yellow cab, Green cab, and for hire vehicles. Just the Yellow cab data from 01/2016 – 06/2018 is over 112,000,000 records (24 GBs) and they download into easy to import comma separated values (CSV) files.

Read on to see how you can set it up yourself. As Derik points out at the end, though, this is still one big table, but there are a few columns which can lead to dimensions, things like rate code, location, and payment type.

Comments closed

Looking At Databricks Cluster Pricing

Published 2018-10-19 by Kevin Feasel

Tristan Robinson takes a look at Azure Databricks pricing:

The use of databricks for data engineering or data analytics workloads is becoming more prevalent as the platform grows, and has made its way into most of our recent modern data architecture proposals – whether that be PaaS warehouses, or data science platforms.

To run any type of workload on the platform, you will need to setup a cluster to do the processing for you. While the Azure-based platform has made this relatively simple for development purposes, i.e. give it a name, select a runtime, select the type of VMs you want and away you go – for production workloads, a bit more thought needs to go into the configuration/cost. In the following blog I’ll start by looking at the pricing in a bit more detail which will aim to provide a cost element to the cluster configuration process.

There are a few complicating factors in figuring out cluster price but rest assured that it will be costly.

Comments closed

Automating Azure SQL Database Scaling

Published 2018-10-11 by Kevin Feasel

Arun Sirpal shows how to use Azure Logic Apps to auto-scale Azure SQL Database:

When I was presenting my Azure SQL Database session at DataRelay (used to be SQLRelay) I was asked (over coffee) about auto scaling capabilities. Quite simply there is nothing out of the box to achieve this. The idea of auto scaling would be good where you would need a burst to fulfill higher demand in terms of workload for a time duration, you know, something like “end of the day, Friday night sale” for your database.

Classically you would probably go down the PowerShell route via a runbook, but I am different.

In this case, the automation is timer-based rather than load-based.

Comments closed

Deploying An Azure Container Within A Virtual Network

Published 2018-10-05 by Kevin Feasel

Andrew Pruski shows us that you can now deploy an Azure container running SQL Server within an Azure virtual network:

Up until now Azure Container Instances only had one option to allow us to connect. That was assigning a public IP address that was directly exposed to the internet.

Not really great as exposing SQL Server on port 1433 to the internet is generally a bad idea: –

Now I know there’s a lot of debated about whether or not you should change the port that SQL is listening on to prevent this from happening. My personal opinion is, that if someone wants to get into your SQL instance, changing the port isn’t going to slow them down much. However, a port change will stop opportunistic hacks (such as the above).

But now we have another option. The ability to deploy a ACI within a virtual network in Azure! So let’s run through how to deploy.

Click through for those instructions.

Comments closed

Azure Data Factory Or Integration Services?

Published 2018-10-04 by Kevin Feasel

Teo Lachev contrasts use cases for Integration Services vesus Azure Data Factory V2:

So, ADF was incorrectly positioned as “SSIS for the Cloud” and unfortunately once that message made it out there was a messaging problem that Microsoft has been fighting ever since. Like Azure ML, on the glory road to the cloud things that were difficult with SSIS (installation, projects, deployment) became simple, and things that were simple became difficult. Naturally, Microsoft took a lot of criticism from the customers and community, including from your humble correspondent. ADF, or course, has nothing to do with SSIS, thus leaving many data integration practitioners with a difficult choice: should you take the risk and take the road less traveled with ADF, or continue with the tried-and-true SSIS for data integration on Azure?

To Microsoft’s credits, ADF v2 has made significant enhancements in features, usability, and maintainability. There is an also a “lift and shift” option to run SSIS inside ADF but since this architecture requires a VM, I consider it a narrow case scenario, such as when you need to extend ADF with SSIS features that it doesn’t have. Otherwise, why would you start new development with SSIS hosted under ADF, if you could provision and license the VM yourself and have full control over it?

All in all, Teo is not the biggest fan of ADF at this point and leans heavily toward SSIS; read on for the reasoning.

Comments closed