Recalculating Days

Brian Mitchell shows how to re-calculate prior days in Azure Data Lake using partitioning:

The question is what is the right time period to use? The answer is it depends on the size of your partitions.  Generally, for managed tables in U-SQL, you want to target about 1 GB per partition.  So, if you are bringing in say 800 mb per day then daily partitions are about right.  If instead you are bringing in 20 GB per day, you should look at hourly partitions of the data.

In this post, I’d like to take a look at two common scenarios that people run into.  The first is full re-compute of partitions data and the second is a partial re-compute of a partition.  The examples I will be using are based off of the U-SQL Ambulance Demo’s on Github and will be added to the solution for ease of your consumption.

The ability to reprocess data is vital in any ETL or ELT process.

Related Posts

Security Improvements In Kafka And Confluent Platform

Vahid Fereydouny demonstrates a number of security improvements made to Apache Kafka 2.0 as well as Confluent Platform 5.0: Over the past several quarters, we have made major security enhancements to Confluent Platform, which have helped many of you safeguard your business-critical applications. With the latest release, we increased the robustness of our security feature […]

Read More

SparkSession Versus SparkContext

Abhishek Baranwal explains the differences between the SparkSession object and the SparkContext object when writing Spark code: Prior to spark 2.0, SparkContext was used as a channel to access all spark functionality. The spark driver program uses sparkContext to connect to the cluster through resource manager. SparkConf is required to create the spark context object, […]

Read More

Categories

May 2016
MTWTFSS
« Apr Jun »
 1
2345678
9101112131415
16171819202122
23242526272829
3031