Connecting Spark and Riak

Kevin Feasel

2016-08-15

Riak, Spark

Pavel Hardak discusses the Riak Connector for Apache Spark:

Modeled using principles from the “AWS Dynamo” paper, Riak KV buckets are good for scenarios which require frequent, small data-sized operations in near real-time, especially workloads with reads, writes, and updates — something which might cause data corruption in some distributed databases or bring them to “crawl” under bigger workloads. In Riak, each data item is replicated on several nodes, which allows the database to process a huge number of operations with very low latency while having unique anti-corruption and conflict-resolution mechanisms. However, integration with Apache Spark requires a very different mode of operation — extracting large amounts of data in bulk, so that Spark can do its “magic” in memory over the whole data set. One approach to solve this challenge is to create a myriad of Spark workers, each asking for several data items. This approach works well with Riak, but it creates unacceptable overhead on the Spark side.

This is interesting in that it ties together two data platforms whose strengths are almost the opposite:  one is great for fast, small writes of single records and the other is great for operating on large batches of data.

Related Posts

Summarizing Improvements In Spark 2.4

Anmol Sarna summarizes Apache Spark 2.4 and pushes his meme game at the same time: The next major enhancement was the addition of a lot of new built-in functions, including higher-order functions, to deal with complex data types easier.Spark 2.4 introduced 24 new built-in functions, such as  array_union, array_max/min, etc., and 5 higher-order functions, such as transform, filter, etc.The entire […]

Read More

Working With Images In Spark 2.4

Tomas Nykodym and Weichen Xu give us an update on working with images in the most recent version of Apache Spark: An image data source addresses many of these problems by providing the standard representation you can code against and abstracts from the details of a particular image representation.Apache Spark 2.3 provided the ImageSchema.readImages API (see Microsoft’s post […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031