Kafka Partitioning Strategies

Kevin Feasel

2018-03-21

Hadoop

Amy Boyle shares some thoughts on Kafka partitioning strategy:

If you have enough load that you need more than a single instance of your application, you need to partition your data. The producer clients decide which topic partition data ends up in, but it’s what the consumer applications will do with that data that drives the decision logic. If possible, the best partitioning strategy to use is random.

However, you may need to partition on an attribute of the data if:

  • The consumers of the topic need to aggregate by some attribute of the data.

  • The consumers need some sort of ordering guarantee.

  • Another resource is a bottleneck and you need to shard data.

  • You want to concentrate data for the efficiency of storage and/or indexing.

Good advice.

Related Posts

Comparing Performance: HBase1 vs HBase2

Surbhi Kochhar takes us through performance improvements between HBase version 1 and HBase version 2: We are loading the YCSB dataset with 1000,000,000 records with each record 1KB in size, creating total 1TB of data. After loading, we wait for all compaction operations to finish before starting workload test. Each workload tested was run 3 […]

Read More

The Transaction Log in Delta Tables

Burak Yavuz, et al, explain how the transaction log works with Delta Tables in Apache Spark: When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit […]

Read More

Categories

March 2018
MTWTFSS
« Feb Apr »
 1234
567891011
12131415161718
19202122232425
262728293031