Measuring HDFS Cache Performance Gains

Kevin Feasel

2019-04-15

Hadoop

Guy Shilo tries out the HDFS centralized cache:

HDFS offers a caching mechanism that takes advantage of the Data nodes memory. Blocks are loaded in memory and pinned there so that when a client requests those blocks they can be served directly from memory which is much faster than disk. There are some 3rd party products out there that does the same, but this option comes with Hadoop out of the box.

Hadoop  has a special set of commands for managing this cache – the cacheadmin commands.

You must explicitly cache a directory or a file, and in case you cache a directory the caching is not recursive and sub directories will not be cached automatically. The full documentation can be found here. I was curious to see if Cloudera has integrated cache commands into their Cloudera manager, but was surprised to see that their documentation about it is basically a copy of the Apache hadoop guide and you still have to use the command line cacheadmin.

Click through to see how it performed in Guy’s scenario.

Related Posts

Hooking SQL Server to Kafka

Niels Berglund has an interesting scenario for us: We see how the procedure in Code Snippet 2 takes relevant gameplay details and inserts them into the dbo.tb_GamePlay table. In our scenario, we want to stream the individual gameplay events, but we cannot alter the services which generate the gameplay. We instead decide to generate the event from the database […]

Read More

Notebooks in Azure Databricks

Brad Llewellyn takes us through Azure Databricks notebooks: Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R.  However, selecting a language in this drop-down doesn’t limit us to only using that language.  Instead, it makes the default language of the notebook.  Every code block in the notebook is run independently and we […]

Read More

Categories

April 2019
MTWTFSS
« Mar May »
1234567
891011121314
15161718192021
22232425262728
2930