Tomaz Kastrun continues a series on working with Apache Spark. Part 10 looks at the DataFrame construct:
We have looked in datasets and seen that a dataset is distributed collection of data. A dataset can be constructer from JVM object and later manipulated with transformation operations (e.g.: filter(), Map(),…). API for these datasets are also available in Scala and in Java. But in both cases of Python and R, you can also access the columns or rows from datasets.
On the other hand, dataframe is organised dataset with named columns. It offers much better optimizations and computations and still resembles a typical table (as we know it from database world). Dataframes can be constructed from arrays or from matrices from variety of files, SQL tables, and datasets (RDDs). Dataframe API is available in all flavours: Java, Scala, R and Python and hence it’s popularity.
Part 11 looks at external R and Python packages and DataFrame support:
When you install Spark, the extension of not only languages but also other packages, systems is huge. For example with R, not only that you can harvest the capabilities of distributed and parallel computations, you can also extend the use of R language.
Spark SQL is a one of the Spark modules for structured data processing and analysing. Spark provides Spark SQL and also API for execution of SQL queries. Spark SQL can read data from Hive instance, but also from datasets and dataframe. The communication between Spark SQL and execution engine will always result in a dataset or datafrane.
These formats are interchangeable. So interacting with SQL against result from a different API is possible, respectively. Plugging in the Java JDBD or standard ODBC drivers will also give your SQL interface access to different sources. This unification means that developers can easily switch back and forth between different APIs based on which provides the most natural way to express a given transformation.
With API unification, user can access Spark SQL using Scala
spark-shell
, using Pythonpyspark
or using RsparkR
shell.
DataFrames are so popular that they’ve become the de facto standard for working with data in Spark, and .NET languages only work with DataFrames, not with the raw RDDs.