Overlapping Ranges Using U-SQL

Michael Rys explains how to merge overlapping ranges of data using U-SQL:

If you look at the problem, you will at first notice that you want to define something like a user-defined aggregation to combine the overlapping time intervals. However, if you look at the input data, you will notice that since the data is not ordered, you will either have to maintain the state for all possible intervals and then merge disjoint intervals as bridging intervals appear, or you need to preorder the intervals for each user name to make the merging of the intervals easier.

The ordered aggregation is simpler to scale out, but U-SQL does not provide ordered user-defined aggregators (UDAGGs) yet. In addition, UDAGGs normally produce one row per group, while in this case, I may have multiple rows per group if the ranges are disjoint.

Luckily, U-SQL provides a scalable user-defined operator called a reducer which gives us the ability to aggregate a set of rows based on a grouping key set using custom code.

There are some good insights here, so read the whole thing.

Related Posts

Taking Control In The Cloud

Arun Sirpal advises you to enjoy the change in control when moving to Azure: The key for me has been to “embrace the change”. I have come from a traditional DBA background, backups, consistency checks, server level configuration at the heart of things. Yes, I no longer care about SQL backups and things of that […]

Read More

Analyzing Twitter Data With Storm In HDInsight

Nischal S shows how to configure an HDInsight cluster to process tweets, followed by loading them into a Power BI dashboard: When we need to process streams of real-time data, Storm is a great contender. Examples of streaming data are the number of consumer clicks and navigations on a website, IIS or user logs, IoT […]

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930