Re-Shaping Data Flows

Maneesh Varshney explains some methods to trim the fat out of analytical data flows:

Big data comes in a variety of shapes. The Extract-Transform-Load (ETL) workflows are more or less stripe-shaped (left panel in the figure above) and produce an output of a similar size to the input. Reporting workflows are funnel-shaped (middle panel in the figure above) and progressively reduce the data size by filtering and aggregating.

However, a wide class of problems in analytics, relevance, and graph processing have a rather curious shape of widening in the middle before slimming down (right panel in the figure above). It gets worse before it gets better.

In this article, we take a deeper dive into this exploding middle shape: understanding why it happens, why it’s a problem, and what can we do about it. We share our experiences of real-life workflows from a spectrum of fields, including Analytics (A/B experimentation), Relevance (user-item feature scoring), and Graph (second degree network/friends-of-friends).

The examples relate directly to Hadoop, but are applicable in other data platforms as well.

Related Posts

Kafka Streams Basics

Anuj Saxena walks through Kafka Streams and provides a quick example: The features provided by Kafka Streams: Highly scalable, elastic, distributed, and fault-tolerant application. Stateful and stateless processing. Event-time processing with windowing, joins, and aggregations. We can use the already-defined most common transformation operation using Kafka Streams DSL or the lower-level processor API, which allow us […]

Read More

R For Apache Impala

Ian Cook describes implyr, an R interface for Apache Impala: dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate(), select(), filter(), summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed […]

Read More


June 2017
« May Jul »