Category: Cloud

Azure Data Lake Analytics Updates

Published 2018-06-12 by Kevin Feasel

Michael Rys has a boatload of new updates for Azure Data Lake:

The top items include expanding our built-in support for standard file formats with native Parquet support for extractors and outputters (in public preview) and ORC (in private preview)!

In addition, since the fast file set feature now has been generally released, we can consume hundreds of thousands of such files in bulk in a single EXTRACT statement. We will publish a blog at a later date to give you much more detailed information on how this capability helps you to process so many files efficiently in a scalable way.

Important aspects of processing files at scale include:

the ability to generate many files from a rowset in a single statement, providing a way to dynamically partition the data for future use with Hadoop or Spark, or to provide individual files for customers. This has been our top customer ask on the ADL Feedback forum –and now it is in private preview!
the ability to handle many small files. We recommend that you make your files large enough for the processing to be efficient (300MB to 4GB is a good range), but often, your file formats (e.g., images) or data ingestion pipelines (e.g., EventHub archives) are not able to reach that size. Thus, we are adding the ability to group several files into a vertex to increase efficiency and lower cost of your job (we have seen 10 to 30 times improvement in some customer jobs!).

Read on for the full changelog.

Comments closed

Flattening JSON Data With Databricks

Published 2018-06-11 by Kevin Feasel

Ivan Vazharov gives us a Databricks notebook to parse and flatten JSON using PySpark:

With Databricks you get:

An easy way to infer the JSON schema and avoid creating it manually

Subtle changes in the JSON schema won’t break things

The ability to explode nested lists into rows in a very easy way (see the Notebook below)

Speed!

Following is an example Databricks Notebook (Python) demonstrating the above claims. The JSON sample consists of an imaginary JSON result set, which contains a list of car models within a list of car vendors within a list of people. We want to flatten this result into a dataframe.

Click through for the notebook.

Comments closed

Figuring Out Azure Analysis Services Costs

Published 2018-06-11 by Kevin Feasel

Chris Webb explains that Azure Analysis Services might not be quite as expensive as you’d first think:

What does this mean for the cost of Azure Analysis Services? Basically, if you’re taking advantage of these features you won’t pay one of the monthly prices quoted on the pricing page linked to at the top of this post. Instead you may do things like:

Scale up for one hour every day when you need to process your SSAS database, just to get the extra memory and QPUs needed, then scale down when processing has finished

Scale out only on certain days, or certain times of day, to handle increased numbers of users

Pause your instance when you are sure that no-one needs to run queries

How do you then calculate the likely cost? For my Azure Analysis Services precon at SQLBits a few months ago I built an Excel workbook that shows how to go about this.

There are some good questions in the comments section, so check those out as well.

Comments closed

Restoring Point-In-Time To Another Azure SQL Managed Instance

Published 2018-06-08 by Kevin Feasel

Jovan Popovic announces an improvement to Azure SQL Database Managed Instances:

Azure SQL Database Managed Instance enables you to create a database as a copy of another database at some point in time in the past. This is known as point-in-time restore feature, and up till now you could perform point-in-time restore only within the same instance.

The latest release of Azure SQL Database Managed Instance enables you to perform point-in-time restore of a database from one instance to another. This might be useful if you need to be sure that you could easily restore a database to another instance if there is some issue on the original instance, or if you need a database for testing or auditing purposes on the test instance and you want to use copy of some of the existing database on another server.

Click through for the current requirements and limitations, as well as a sample.

Comments closed

Polybase Rejected Row Location

Published 2018-06-08 by Kevin Feasel

Casey Karst announces a nice improvement to Polybase on Azure SQL Data Warehouse:

Every row of your data is an insight waiting to be found. That is why it is critical you can get every row loaded into your data warehouse. When the data is clean, loading data into Azure SQL Data Warehouse is easy using PolyBase. It is elastic, globally available, and leverages Massively Parallel Processing (MPP). In reality clean data is a luxury that is not always available. In those cases you need to know which rows failed to load and why.

In Azure SQL Data Warehouse the Create External Table definition has been extended to include a Rejected_Row_Location parameter. This value represents the location in the External Data Source where the Error File(s) and Rejected Row(s) will be written.

This is a big improvement, one that I hope to see on the on-prem product.

Comments closed

Event Hub Performance Tips

Published 2018-06-07 by Kevin Feasel

Vincent-Philippe Lauzon has a few tips for improving Azure Event Hub performance:

Here are some recommendations in the light of the performance and throughput results:

If we send many events: always reuse connections, i.e. do not create a connection only for one event. This is valid for both AMQP and HTTP. A simple Connection Pool pattern makes this easy.

If we send many events & throughput is a concern: use AMQP.

If we send few events and latency is a concern: use HTTP / REST.

If events naturally comes in batch of many events: use batch API.

If events do not naturally comes in batch of many events: simply stream events. Do not try to batch them unless network IO is constrained.

If a latency of 0.1 seconds is a concern: move the call to Event Hubs away from your critical performance path.

Let’s now look at the tests we did to come up with those recommendations.

Read the whole thing.

Comments closed

Databricks MLflow

Published 2018-06-06 by Kevin Feasel

Matai Zaharia announces a new Databricks offering:

MLflow is inspired by existing ML platforms, but it is designed to be open in two senses:

Open interface: MLflow is designed to work with any ML library, algorithm, deployment tool or language. It’s built around REST APIs and simple data formats (e.g., a model can be viewed as a lambda function) that can be used from a variety of tools, instead of only providing a small set of built-in functionality. This also makes it easy to add MLflow to your existing ML code so you can benefit from it immediately, and to share code using any ML library that others in your organization can run.

Open source: We’re releasing MLflow as an open source project that users and library developers can extend. In addition, MLflow’s open format makes it very easy to share workflow steps and models across organizations if you wish to open source your code.

Mlflow is still currently in alpha, but we believe that it already offers a useful framework to work with ML code, and we would love to hear your feedback. In this post, we’ll introduce MLflow in detail and explain its components.

Even in alpha, it looks nice.

Comments closed

The Basics Of Azure Stream Analytics

Published 2018-06-06 by Kevin Feasel

Chris Seferlis gives us an overview of Azure Stream Analytics:

Here’s how it works. It starts with a data source such as Event Hub, IoT Hub or Azure Blob Storage, and it uses SQL-like query language that allows transformation on the fly. It helps you process operations like filtering, sorting, aggregating and joining the data together to make it more useable—turning data into information.

From there, when you identify the data that you want/need to use, you can then send that data downstream to be sent to a queue for triggering workflows or further processing of the data. You can also send that data to Power BI for real-time visualization. For example, let’s say you’re looking at a data quality stream and you want to pull certain key words out of Twitter to see how they’re used and watch how that’s being done. By connecting to the Twitter API, you can capture that data, stream it, and then report from it with a Power BI report.

Chris also has a video which you can watch.

Comments closed

Lookups And Conditionals In Azure Data Factory V2

Published 2018-06-06 by Kevin Feasel

Alex Whittles shows us how to perform lookups and operations with IF clauses in Azure Data Factory V2:

Azure Data Factory v2 (ADFv2) has some significant improvements over v1, and we now consider ADF as a viable platform for most of our cloud based projects. But things aren’t always as straightforward as they could be. I’m sure this will improve over time, but don’t let that stop you from getting started now.

This post provides a walk through of using the ‘Lookup’ and ‘If Condition’ activities to do some basic conditional logic depending on the results of a database query.

Assumptions: You already have an ADF pipeline created. If you want to hook into SSIS then you’ll also need the SSIS Integration Runtime set up – although this is not relevant just for the if condition.

Read on for an example.

Comments closed

Connecting To Azure SQL Database From On-Prem

Published 2018-06-06 by Kevin Feasel

Arun Sirpal shows how to set up a linked server instance between an on-prem SQL Server instance and Azure SQL Database:

You may (or may not) have a requirement to setup a linked server to Azure SQL Database from a locally installed SQL Server. One reason could be to pull down some reports from an Azure SQL Database to a local file share. Whatever your reason is hopefully you will find this blog post useful because I ran into some complications on the way.

This is what your linked server creation screens in SSMS (SQL Server Management Studio) should look like.

Take advantage of Arun’s hard-earned experience and read his post.

Comments closed

M	T	W	T	F	S	S
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31