Basics Of Spark

Kevin Feasel

2016-11-01

Spark

Jen Underwood gives a quick explanation of Spark as well as an introduction to SparkSQL and PySpark:

Spark’s distributed data-sharing concept is called “Resilient Distributed Datasets,” or RDD. RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. RDDs are created by applying operations called “transformations” with map, filter, and groupBy clauses. They can persist in memory for rapid reuse. If an RDD data does not fit in memory, Spark will overflow it to disk.

If you’re not familiar with Spark, now’s as good a time as any to learn.

Related Posts

The Transaction Log in Delta Tables

Burak Yavuz, et al, explain how the transaction log works with Delta Tables in Apache Spark: When a user creates a Delta Lake table, that table’s transaction log is automatically created in the _delta_log subdirectory. As he or she makes changes to that table, those changes are recorded as ordered, atomic commits in the transaction log. Each commit […]

Read More

Spark for .NET Developers

Ed Elliott has a long-form post covering spark-dotnet: The .NET driver is made up of two parts, and the first part is a Java JAR file which is loaded by Spark and then runs the .NET application. The second part of the .NET driver runs in the process and acts as a proxy between the […]

Read More

Categories

November 2016
MTWTFSS
« Oct Dec »
 123456
78910111213
14151617181920
21222324252627
282930