The Spark Ecosystem

Kevin Feasel



Frank Evans gives an overview of what the Apache Spark ecosystem looks like:

The built-in machine learning library in Spark is broken into two parts: MLlib and KeystoneML.

  • MLlib: This is the principal library for machine learning tasks. It includes both algorithms and specialized data structures. Machine learning algorithms for clustering, regression, classification, and collaborative filtering are available. Data structures such as sparse and dense matrices and vectors, as well as supervised learning structures that act like vectors but denote the features of the data set from its labels, are also available. This makes feeding data into a machine learning algorithm incredibly straightforward and does not require writing a bunch of code to denote how the algorithm should organize the data inside itself.

  • KeystoneML: Like the oil pipeline it takes its name from, KeystoneML is built to help construct machine learning pipelines. The pipelines help prepare the data for the model, build and iteratively test the model, and tune the parameters of the model to squeeze out the best performance and capability.

Whereas Hadoop’s ecosystem is large and sprawling, the Spark ecosystem tends to be more tightly constrained.  The nice part about Spark is that it plays nicely with the Hadoop ecosystem—you can have a cluster or architecture with Spark and Hadoop-centric technologies (Storm, Kafka, Hive, Flume, etc. etc.) working together quite nicely.

Related Posts

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto: Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. […]

Read More

Converting Spark RDDs To DataFrames

Franco Cano shows us how we can convert a Resilient Distributed Dataset in Spark to a DataFrame: Sometimes you need transform you RRDs to DataFrames because DataFrames have a lot optimization options.Let’s see how this is done. This works, but there is a performance cost to converting a large RDD to a DataFrame (or vice versa). […]

Read More


September 2016
« Aug Oct »