Basics Of Spark

Kevin Feasel

2016-11-01

Spark

Jen Underwood gives a quick explanation of Spark as well as an introduction to SparkSQL and PySpark:

Spark’s distributed data-sharing concept is called “Resilient Distributed Datasets,” or RDD. RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. RDDs are created by applying operations called “transformations” with map, filter, and groupBy clauses. They can persist in memory for rapid reuse. If an RDD data does not fit in memory, Spark will overflow it to disk.

If you’re not familiar with Spark, now’s as good a time as any to learn.

Related Posts

Batch Consumption from Kafka with Spark

Swapnil Chougule shares a few tips on performing batch processing of a Kafka topic using Apache Spark: Spark as a compute engine is very widely accepted by most industries. Most of the old data platforms based on MapReduce jobs have been migrated to Spark-based jobs, and some are in the phase of migration. In short, […]

Read More

Securely Accessing External Resources From Databricks AWS

Itai Weiss shows how you can securely hit external data sources when using Databricks for AWS: For security purposes, Databricks Apache Spark clusters are deployed in an isolated VPC dedicated to Databricks within the customer’s account. In order to run their data workloads, there is a need to have secure connectivity between the Databricks Spark […]

Read More

Categories

November 2016
MTWTFSS
« Oct Dec »
 123456
78910111213
14151617181920
21222324252627
282930