Spark Data Structures

Kevin Feasel

2017-07-31

Spark

Shubham Agarwal explains the difference between three Spark data structures:

DataFrame(DF) – 

DataFrame is an abstraction which gives a schema view of data. Which means it gives us a view of data as columns with column name and types info, We can think data in data frame like a table in the database.

Like RDD, execution in Dataframe too is lazy triggered.

Read on to learn more about Resilient Distributed Datasets, DataFrames, and DataSets.

Related Posts

Handling Missing Data In Spark

Igor Sorokin explains how to implement DataFrameNaFunctions: Unfortunately, C&P comes in to play, therefore, if at some point in time a default value for ‘trackLength’ is also required, you may end up changing both of these methods. Another disadvantage is that if another similar method, which requires the same default values, is added, code duplication […]

Read More

Diving Into Spark’s Cost-Based Optimizer

Ron Hu, et al, explain how Spark’s cost-based optimizer works: At its core, Spark’s Catalyst optimizer is a general library for representing query plans as trees and sequentially applying a number of optimization rules to manipulate them. A majority of these optimization rules are based on heuristics, i.e., they only account for a query’s structure and ignore […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31