Spark Overview

Jen Underwood provides an overview of the Apache Spark project:

Spark provides a comprehensive framework to manage big data processing with a variety of data set types including text and graph data. It can also handle batch pipelines and real-time streaming data. Using Spark libraries, you can create big data analytics apps in Java, Scala, Clojure, and popular R and Python languages.

Spark brings analytics pros an improved MapReduce type query capability with more performant data processing in memory or on disk. It can be used with datasets that are larger than the aggregate memory in a cluster. Spark also has savvy lazy evaluation of big data queries which helps with workflow optimization and reuse of intermediate results in memory. TheSpark API is easy to learn.

One of my taglines is, Spark is not the future of Hadoop; Spark is the present of Hadoop.  If you want to get into this space, learn how to work with Spark.

Related Posts

Avro Schemas In Kafka

Stephane Maarek explains the value of using Apache Avro as a schema structure for your Kafka topics: Avro has support for primitive types ( int, string, long, bytes, etc…), complex types (enum, arrays, unions, optionals), logical types (dates, timestamp-millis, decimal), and data record (name and namespace). All the types you’ll ever need. Avro has support for embedded documentation. Although documentation is optional, in my workflow I […]

Read More

When Spark Meets Hive

Anna Martin and Rosaria Silipo look at combining HiveQL and SparkQL: We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and […]

Read More

Categories

October 2016
MTWTFSS
« Sep Nov »
 12
3456789
10111213141516
17181920212223
24252627282930
31