Recalculating Days

Brian Mitchell shows how to re-calculate prior days in Azure Data Lake using partitioning:

The question is what is the right time period to use? The answer is it depends on the size of your partitions.  Generally, for managed tables in U-SQL, you want to target about 1 GB per partition.  So, if you are bringing in say 800 mb per day then daily partitions are about right.  If instead you are bringing in 20 GB per day, you should look at hourly partitions of the data.

In this post, I’d like to take a look at two common scenarios that people run into.  The first is full re-compute of partitions data and the second is a partial re-compute of a partition.  The examples I will be using are based off of the U-SQL Ambulance Demo’s on Github and will be added to the solution for ease of your consumption.

The ability to reprocess data is vital in any ETL or ELT process.

Related Posts

Replicating Data In HDFS Between Clusters

Murali Ramasami and Niru Anisetti have an article showing how to use the Hortonworks Data Lifecycle Manager to set up replication between two Hadoop clusters: Data Lifecycle Manager (DLM) delivers on the promise of location-agnostic, secure replication by encapsulating and copying data seamlessly across physical private storage and public cloud environments. This empowers businesses to […]

Read More

Installing Confluent Platform On Windows

Niels Berglund shows how to install Confluent Platform (the Confluent branded version of Apache Kafka) on a Windows machine using the Windows Subsystem for Linux: WSL is primarily aimed at developers, and it allows you to run Linux environments directly on Windows in a native format and without the overhead of a virtual machine. Let us […]

Read More


May 2016
« Apr Jun »