The Spark Ecosystem

Kevin Feasel

2016-09-02

Spark

Frank Evans gives an overview of what the Apache Spark ecosystem looks like:

The built-in machine learning library in Spark is broken into two parts: MLlib and KeystoneML.

  • MLlib: This is the principal library for machine learning tasks. It includes both algorithms and specialized data structures. Machine learning algorithms for clustering, regression, classification, and collaborative filtering are available. Data structures such as sparse and dense matrices and vectors, as well as supervised learning structures that act like vectors but denote the features of the data set from its labels, are also available. This makes feeding data into a machine learning algorithm incredibly straightforward and does not require writing a bunch of code to denote how the algorithm should organize the data inside itself.

  • KeystoneML: Like the oil pipeline it takes its name from, KeystoneML is built to help construct machine learning pipelines. The pipelines help prepare the data for the model, build and iteratively test the model, and tune the parameters of the model to squeeze out the best performance and capability.

Whereas Hadoop’s ecosystem is large and sprawling, the Spark ecosystem tends to be more tightly constrained.  The nice part about Spark is that it plays nicely with the Hadoop ecosystem—you can have a cluster or architecture with Spark and Hadoop-centric technologies (Storm, Kafka, Hive, Flume, etc. etc.) working together quite nicely.

Related Posts

LSTM in Databricks

Vedant Jain shows us an example of solving a multivariate time series forecasting problem using LSTM networks: LSTM is a type of Recurrent Neural Network (RNN) that allows the network to retain long-term dependencies at a given time from many timesteps before. RNNs were designed to that effect using a simple feedback approach for neurons where the […]

Read More

Databricks versus Mapping Data Flows

Helge Rege Gardsvoll contrasts Azure Databricks, Azure Data Factory Mapping Data Flows, and SQL Server Integration Services: Mapping Data FlowsOne of the many data flows from Microsoft these days providing, for the first time, data transformation capabilities within Data Factory. This is not a U-SQL script or Databricks notebook that is orchestrated from Data Factory, […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930