Hadoop: DAS Or NAS?

Jagdish Mirani asks whether you should prefer Direct Attached Storage (DAS) or Network Attached Storage (NAS) for your Hadoop cluster:

If you want to spin up an Apache Hadoop cluster, you need to grapple with the question of how to attach your disks. Historically, this decision has favored direct attached storage (DAS). This approach is in keeping with the fundamental Hadoop principle of moving processing to a where the data lives, thereby taking advantage of disk locality to optimize performance. Disk locality is so core to Hadoop that virtually any description of Hadoop starts with this.

The alternative is to use network attached storage (NAS). In contrast to DAS, NAS separates the compute and storage layers so that storage can be shared across a number of servers by shipping data over the network. Historically, this heavy dependence on the network made NAS an order of magnitude slower. Remember, the state of the art was 1GbE networks, and switches were slower and more expensive. I/O requirements for demanding Hadoop-based applications could only be met by DAS.

This is a very interesting discussion.  In my limited experience, I’ve had trouble selling operations teams on DAS, given the increased ops effort required to keep a bunch of attached disks going.  Hat tip Ari Amster.

Related Posts

It’s All ETL (Or ELT) In The End

Robin Moffatt notes that ETL (and ELT) doesn’t go away in a streaming world: In the past we used ETL techniques purely within the data-warehousing and analytic space. But, if one considers why and what ETL is doing, it is actually a lot more applicable as a broader concept. Extract: Data is available from a source system Transform: We […]

Read More

Flint: Time Series With Spark

Li Jin and Kevin Rasmussen cover the concepts of Flint, a time-series library built on Apache Spark: Time series analysis has two components: time series manipulation and time series modeling. Time series manipulation is the process of manipulating and transforming data into features for training a model. Time series manipulation is used for tasks like data […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031