Concurrency In Hadoop Using ZooKeeper

Kevin Feasel



Garima Dosi discusses an architecture using ZooKeeper to introduce some limited protections for concurrent access in HDFS:

The ZooKeeper nodes topology as per the design looks like this. ZooKeeper works like a filesystem starting with a root directory followed with several nodes (analogous to folders) and finally the data nodes (analogous to files). The circles in the image represent the name of a property/folder that we are trying to maintain and the rounded boxes are the values/files for those properties/folders.

So, the image above shows that the “global version” is 100 and there are 10 & 20 read requests being executed on versions 98 and 99 respectively and since there is a write request in progress, no other write request would be taken up until it completes.

This feels a little overly complicated to me.

Related Posts

Stream-To-Stream Joins In Spark

Ayush Tiwari shows how to join a pair of streams in Apache Spark 2.3: In Spark 2.3, it added support for stream-stream joins, i.e, we can join two streaming Datasets/DataFrames and in this blog we are going to see how beautifully spark now give support for joining the two streaming dataframes. I this example, I […]

Read More

Spark: DataFrame To RDD For Data Cleansing

Gilad Moscovitch walks us through a common data cleansing problem with Spark data frames: A problem can arise when one of the inner fields of the json, has undesired non-json values in some of the records. For instance, an inner field might contains HTTP errors, that would be interpreted as a string, rather than as a […]

Read More


August 2016
« Jul Sep »