Press "Enter" to skip to content

Category: U-SQL

Overlapping Ranges Using U-SQL

Michael Rys explains how to merge overlapping ranges of data using U-SQL:

If you look at the problem, you will at first notice that you want to define something like a user-defined aggregation to combine the overlapping time intervals. However, if you look at the input data, you will notice that since the data is not ordered, you will either have to maintain the state for all possible intervals and then merge disjoint intervals as bridging intervals appear, or you need to preorder the intervals for each user name to make the merging of the intervals easier.

The ordered aggregation is simpler to scale out, but U-SQL does not provide ordered user-defined aggregators (UDAGGs) yet. In addition, UDAGGs normally produce one row per group, while in this case, I may have multiple rows per group if the ranges are disjoint.

Luckily, U-SQL provides a scalable user-defined operator called a reducer which gives us the ability to aggregate a set of rows based on a grouping key set using custom code.

There are some good insights here, so read the whole thing.

Comments closed

U-SQL Parallelism

Saveen Reddy discusses parallelism with U-SQL jobs:

Previously an Data Lake Analytics account was limited to a total of 60 units on parallelism available to 3 concurrent U-SQL jobs. And each U-SQL job was limited to only using a maximum of 20 units of parallelism. However, sometimes developers want to run a U-SQL job that uses more parallelism. For example: they might want two concurrently running jobs that each use 30 units. Today, we are removing the per-job limit. Now you can concurrently run jobs that use any amount of parallelism as long as the total for the running jobs doesn’t go beyond the maximum for your account (currently 60 units of parallelism).

If you are already running U-SQL jobs, there’s no need to worry. This change doesn’t require any action on your part. Your jobs run just as they did before. But now, if you want to you can take advantage of all the parallelism in your account.

This doesn’t change the available performance units, so there’s no billing change.

Comments closed

Recalculating Days

Brian Mitchell shows how to re-calculate prior days in Azure Data Lake using partitioning:

The question is what is the right time period to use? The answer is it depends on the size of your partitions.  Generally, for managed tables in U-SQL, you want to target about 1 GB per partition.  So, if you are bringing in say 800 mb per day then daily partitions are about right.  If instead you are bringing in 20 GB per day, you should look at hourly partitions of the data.

In this post, I’d like to take a look at two common scenarios that people run into.  The first is full re-compute of partitions data and the second is a partial re-compute of a partition.  The examples I will be using are based off of the U-SQL Ambulance Demo’s on Github and will be added to the solution for ease of your consumption.

The ability to reprocess data is vital in any ETL or ELT process.

Comments closed

U-SQL Movie Recommender

Dave Ballantyne introduces us to U-SQL via a movie recommender:

What follows is an overview of my experiments that i have published into a GitHib repo. The “Examples” folder are what i would term “simple learnings” and “Full Scripts” are scripts that to a lesser or greater extent do something “useful”.  Im also not suggesting that anything here is “best practice” or method A performs better than method B,  I simply do not have the required size of data to make that call. My aim was to learn the language.

TLDR: Check out the script MovieLens09-CosineSimilarityFromCSVWithMax.usql for a U-SQL movie recommender.

U-SQL was introduced last year, but word of mouth about the language has been quite limited to date.  I’ll be interested in seeing what other examples pop up over the next few months.

Comments closed

Get Started With U-SQL

Microsoft is pushing U-SQL pretty hard.  Here’s a tutorial by Jonathan Gao to whet your appetite:

U-SQL is a language that unifies the benefits of SQL with the expressive power of your own code to process all data at any scale. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across relational stores such as Azure SQL Database. It enables you to process unstructured data by applying schema on read, insert custom logic and UDF’s, and includes extensibility to enable fine grained control over how to execute at scale. To learn more about the design philosophy behind U-SQL, please refer to this Visual Studio blog post.

You do need Data Lake Tools for Visual Studio, but it looks like you can run it locally.

The VS blog had something a month ago on the topic.  I’m not saying get on it…yet…

Comments closed