Jen Underwood gives a quick explanation of Spark as well as an introduction to SparkSQL and PySpark:
Spark’s distributed data-sharing concept is called “Resilient Distributed Datasets,” or RDD. RDDs are fault-tolerant collections of objects partitioned across a cluster that can be queried in parallel and used in a variety of workload types. RDDs are created by applying operations called “transformations” with map, filter, and groupBy clauses. They can persist in memory for rapid reuse. If an RDD data does not fit in memory, Spark will overflow it to disk.
If you’re not familiar with Spark, now’s as good a time as any to learn.
Comments closed