HDFS Erasure Coding

Kevin Feasel

2018-04-02

Hadoop

Ayush Tiwari explains how HDFS Erasure Coding lets us keep the same data durability while improving storage efficiency:

In this particular example, when you look at the codeword, actual data cells are 6(blue cells) & 3(red cells) are the parity cells which are simply obtained by multiplied our data cells to generation matrix.
Storage failure can be recovered by the multiplying inverse of generator matrix with the extended codewords as long as ‘k’ out of ‘k+m’ cells are available.
Therefore, here Data Durability is 3 as it can handle 2 simultaneous failure, Storage Efficiency is 67% (as we are using only 1 extra block i.e. 6/9) & only we need to store half number of cells as compared to original number of cells, we can conclude that we also have only 50% overhead in it.

It’s a good explanation of one of the biggest improvements to HDFS over the past several years.

Related Posts

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications: When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running […]

Read More

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Categories

April 2018
MTWTFSS
« Mar May »
 1
2345678
9101112131415
16171819202122
23242526272829
30