Press "Enter" to skip to content

Category: Cloud

Azure Data Lake Updates

Michael Rys has the October updates for Azure Data Lake:

We seem to be just cranking out new stuff :). Here are the October 2016 Updates for Azure Data Lake U-SQL!

The main take away is that the October refresh has now removed the old deprecated syntax of the items we have announced over the last couple of release notes!

Thanks for those who volunteered to test the new version of more scalable file set. Please contact us if you want to try it and help us validate it.

Click through for the release notes.

Comments closed

Azure Data Lake Analytics Units

Yan Li explains the Azure Data Lake Analytics Unit:

An Azure Data Lake Analytics Unit, or AU, is a unit of computation resources made available to your U-SQL job. Each AU  gives your job access to a set of underlying resources like CPU and memory. Currently, an AU is the equivalent of 2 CPU cores and 6 GB of RAM. As we see how people want to use the service, we may change the definition of an AU or more options for controlling CPU and memory usage.

How AUs are used during U-SQL Query Execution

When you submit a U-SQL script for execution, the U-SQL compiler parallelizes the U-SQL script into hundreds or even thousands of tasks called vertices. Each vertex is allocated to one AU. The AU is dynamically allocated to the task and released once that particular task is completed.

I appreciate the ADL team’s transparency in how they define a unit.  It’s much nicer to be able to tell someone that an AU is 2 CPU cores + 6 GB of RAM, rather than saying it’s some fuzzy measure of CPU + memory + I/O which has no direct bearing on your operations.

Comments closed

Database Throughput Units

Randolph West looks at the Azure Database Throughput Unit Calculator:

The DTU Calculator, a third-party service created by Justin Henriksen (a Microsoft employee), will calculate the DTU requirements for our on-premises database that we want to migrate to Azure, by firstly capturing a few performance monitor counters, and then performing a calculation on those results, to provide the recommended service tier for our database.

Justin provides a command-line application or PowerShell script to capture these performance counters:

  • Processor – % Processor Time

  • Logical Disk – Disk Reads/sec

  • Logical Disk – Disk Writes/sec

  • Database – Log Bytes Flushed/sec

For more details on DTUs, John Sterrett looks at the math.

Comments closed

Identity As A Service

Cristian Satnic argues that we should look at Identity as a Service solutions for our applications:

What exactly is Azure Active Directory B2C?

  • Cloud identity service with support for social accounts and app-specific (local) accounts

  • For enterprises and ISVs building consumer facing web, mobile & native apps

  • Builds on Azure Active Directory – a global identity service serving hundreds of millions of users and billions of sign-ins per day (same directory system used by Microsoft online properties – Office 365, XBox Live and so on)

  • Worldwide, highly-available, geo-redundant service – globally distributed directory across all of Microsoft Azure’s datacenters

I am a big fan of OAuth and making it easy for line-of-business developers to deal with authentication (lest they get harebrained ideas like rolling their own encryption algorithms).

Comments closed

Azure Data Lake Updates

Saveen Reddy points out a few updates to Azure Data Lake Store & the Azure Data Lake Analytics portal:

Use Custom Delimeters when Previewing Files

Previously, we had supported comma, colon, space, tab, ampersand, and bar delimiters. With the many different kinds of files used in Azure Data Lake Store and Azure Storage, we’ve added a “Custom” delimiter options for you to define your own delimiter.

To change the delimiter on the Azure Portal:

  1. Open the file you want to preview using Data Explorer.

  2. Click on Format

  3. Under Delimiter, click the dropdown and change it to Custom

  4. A new Custom Delimiter field will appear, type in your delimiter here

  5. Click OK

Read on for more updates.

Comments closed

Starting Azure Stream Analytics Jobs From Code

Hylke Peek wants to kick off an Azure Stream Analytics job from a Universal Windows Platform application:

I had one of those feelings while working with Azure Stream Analytics (ASA). My solution worked but there was one ‘elementary and simple’ thing I wanted: Start the ASA-jobs within my C#-code. That shouldn’t be hard and there’s some documentation. But no, I needed to combine several opposed solutions to a new one to make it possible.

In this post I shortly explain how you can start ASA-jobs within your C# UWP application:

  • I explain which components you have in the authentication process and which parameters you need.

  • Example code is provided. You only need to enter your parameter values.

Click through for the code.

Comments closed

Hadoop’s S3 Support

Steve Loughran and Sanjay Radia give us a history lesson on Hadoop’s support for Amazon S3:

Hadoop’s ability to work with Amazon S3 storage goes back to 2006 and the issue HADOOP-574, “FileSystem implementation for Amazon S3”. This filesystem client, “s3://” implemented an inode-style filesystem atop S3: it could support bigger files than S3 could then support, some its operations (directory rename and delete) were fast. The s3 filesystem allowed Hadoop to be run in Amazon’s EMR infrastructure, using S3 as the persistent store of work. This piece of open source code predated Amazon’s release of EMR, “Elastic MapReduce” by over two years. It’s also notable as the piece of work which gained Tom White, author of “Hadoop, the Definitive Guide”, committer status.

It’s interesting to see how this project has matured over the past decade.

Comments closed

Kafka Plus Spark Streaming

Prasad Alle shows how to integrate Kafka with Spark Streaming on AWS:

Stream processing walkthrough

The entire pattern can be implemented in a few simple steps:

  1. Set up Kafka on AWS.

  2. Spin up an EMR 5.0 cluster with Hadoop, Hive, and Spark.

  3. Create a Kafka topic.

  4. Run the Spark Streaming app to process clickstream events.

  5. Use the Kafka producer app to publish clickstream events into Kafka topic.

  6. Explore clickstream events data with SparkSQL.

This is a pretty easy-to-follow walkthrough with some good tips at the end.

Comments closed

SKLearn To Azure ML

David Crook shows how to build a model using Python’s SciKit library and then operationalize it in Azure ML:

Why Model Outside Azure ML?

Sometimes you run into things like various limitations, speed, data size or perhaps you just iterate better on your own workstation.  I find myself significantly faster on my workstation or in a jupyter notebook that lives on a big ol’ server doing my experiments.  Modelling outside Azure ML allows me to use the full capabilities of whatever infrastructure and framework I want for training.

So Why Operationalize with Azure ML?

AzureML has several benefits such as auto-scale, token generation, high speed python execution modules, api versioning, sharing, tight PaaS integration with things like Stream Analytics among many other things.  This really does make life easier for me.  Sure I can deploy a flask app via docker somewhere, but then, I need to worry about things like load balancing, and then security and I really just don’t want to do that.  I want to build a model, deploy it, and move to the next one.  My value is A.I. not web management, so the more time I spend delivering my value, the more impactful I can be.

Read the whole thing.

Comments closed