Spark Data Structures

Kevin Feasel

2017-07-31

Spark

Shubham Agarwal explains the difference between three Spark data structures:

DataFrame(DF) – 

DataFrame is an abstraction which gives a schema view of data. Which means it gives us a view of data as columns with column name and types info, We can think data in data frame like a table in the database.

Like RDD, execution in Dataframe too is lazy triggered.

Read on to learn more about Resilient Distributed Datasets, DataFrames, and DataSets.

Related Posts

Understanding A Spark Streaming Workflow

Himanshu Gupta continues a series on structured streaming using Spark Streaming: Here we can clearly see that if new data is pushed to the source, Spark will run the “incremental” query that combines the previous running counts with the new data to compute updated counts. The “Input Table” here is the lines DataFrame which acts as a […]

Read More

Calculating TF-IDF Using Apache Spark

Arseniy Tashoyan shows us how to calculate Term Frequency-Inverse Document Frequency using Apache Spark: TF-IDF is used in a large variety of applications. Typical use cases include: Document search. Document tagging. Text preprocessing and feature vector engineering for Machine Learning algorithms. There is a vast number of resources on the web explaining the concept itself […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31