Understanding HDFS Disk Checks

Xiao Chen explains how the HDFS Disk Checker works for data nodes:

The function of block scanner is to scan block data to detect possible corruptions. Since data corruption may happen at any time on any block on any DataNode, it is important to identify those errors in a timely manner. This way, the NameNode can remove the corrupted blocks and re-replicate accordingly, to maintain data integrity and reduce client errors. On the other hand, we don’t want to utilize too many resources, so that disk I/O can still serve actual requests.

Therefore, block scanner needs to make sure that suspicious blocks are scanned relatively quickly, and other blocks are scanned every once in awhile, at a relatively lower frequency, without significant I/O usage.

This is a nice article for operations folks who own Hadoop clusters.

Related Posts

Erasure Coding In Hadoop

Guy Shilo explains erasure coding, a new feature in Hadoop 3: The benefits are, of course, space-saving, and for large files also improved performance (blocks striped across datanodes can be read in parallel, and less blocks are written because there is no x3 replication). The larger the file the more notable is the performance gain. […]

Read More

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto: Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. […]

Read More

Categories

December 2016
MTWTFSS
« Nov Jan »
 1234
567891011
12131415161718
19202122232425
262728293031