HBase Performance Tips

Ashish Thapliyal has nine tips for optimizing HBase performance:

Does your RowKey’s looks like 1,2,3…….. or 00000001, 00000002, 00000003, or do you have Row Key that starts with date-time (starting with the year)? If you answered yes, bad news is that HBase will not scale for you, you have so many options to improve the HBase performance but there is nothing that will compensate for the bad rowkey design.

When rowkey is in sorted order, all the writes go to the same region and other regions will sit ideal doing nothing. you will see one of your node is very stressed trying to cope up with all the writes where as other nodes are thanking you for not giving them enough work. So, always salt your keys by adding random numbers or characters to the row key prefix.

If you are using Phoenix on top of HBase, Phoenix provides a way to transparently salt the row key with a salting byte for a particular table. You need to specify this in table creation time by specifying a table property “SALT_BUCKETS” typical practice is to set the value of SALT_BUCKET =number of region server

I think the biggest one is to design your data structures correctly.  This is particularly important if you’re coming at it from a relational background and are thinking in terms of what makes relational databases fast.

Related Posts

Testing Kafka Streams Applications

Yeva Byzek continues her series on testing Kafka-based streaming applications: When you create a stream processing application with Kafka’s Streams API, you create a Topologyeither using the StreamsBuilder DSL or the low-level Processor API. Normally, the topology runs with the KafkaStreams class, which connects to a Kafka cluster and begins processing when you call start(). For testing though, connecting to a running […]

Read More

Auto ML With SQL Server 2019 Big Data Clusters

Marco Inchiosa has a model scenario for using Big Data Clusters to scale out a machine learning problem: H2O provides popular open source software for data science and machine learning on big data, including Apache SparkTM integration. It provides two open source python AutoML classes: h2o.automl.H2OAutoML and pysparkling.ml.H2OAutoML. Both APIs use the same underlying algorithm implementations, […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930