Basic Spark Terminology

Kevin Feasel

2016-06-28

Spark

Denny Lee and Jules Damji explain some of the key terms and concepts around Apache Spark:

At the core of Apache Spark is the notion of data abstraction as distributed collection of objects. This data abstraction, called Resilient Distributed Dataset (RDD), allows you to write programs that transform these distributed datasets.

RDDs are immutable distributed collection of elements of your data that can be stored in memory or disk across a cluster of machines. The data is partitioned across machines in your cluster that can be operated in parallel with a low-level API that offers transformations and actions. RDDs are fault tolerant as they track data lineage information to rebuild lost data automatically on failure.

Some of these concepts are new to Spark 2.0, but all are worth learning.

Related Posts

Spark Streaming Using DStreams Or DataFrames?

Yaroslav Tkachenko contrasts the two methods for operating on data with Spark Streaming: Spark Streaming went alpha with Spark 0.7.0. It’s based on the idea of discretized streams or DStreams. Each DStream is represented as a sequence of RDDs, so it’s easy to use if you’re coming from low-level RDD-backed batch workloads. DStreams underwent a lot […]

Read More

Databricks Runtime 5.2 Released

Nakul Jamadagni announces Databricks Runtime 5.2: Delta Time TravelTime Travel, released as an Experimental feature, adds the ability to query a snapshot of a table using a timestamp string or a version, using SQL syntax as well as DataFrameReader options for timestamp expressions.Sample codeSELECT count() FROM events TIMESTAMP AS OF timestamp_expressionSELECT count() FROM events VERSION AS OF version Time travel looks a bit like temporal tables in SQL Server.

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930