Using rquery On Databricks

Kevin Feasel

2018-07-27

Hadoop, R, Spark

Nina Zumel and John Mount talk about rquery, a relational data transformation engine for R which runs on Spark:

rquery is based on an appreciation of Codds’ relational algebra. Codd’s relational algebra is a formal algebra that describes the semantics of data transformations and queries. Previous, hierarchical, databases required associations to be represented as functions or maps. Codd relaxed this requirement from functions to relations, allowing tables that represent more powerful associations (allowing, for instance, two-way multimaps).

Codd’s work allows most significant data transformations to be decomposed into sequences made up from a smaller set of fundamental operations:

  • select (row selection)
  • project (column selection/aggregation)
  • Cartesian product (table joins, row binding, and set difference)
  • extend (derived columns, keyword was in Tutorial-D).

One of the earliest and still most common implementation of Codd’s algebra is SQL. Formally Codd’s algebra assumes that all rows in a table are unique; SQL further relaxes this restriction to allow multisets.

rquery is another realization of the Codd algebra that implements the above operators, some higher-order operators, and emphasizes a right to left pipe notation. This gives the Spark user an additional way to work effectively.

They include a fairly lengthy example and give a great introduction to the tool.  It’s now officially on my list of stuff to try out.

Related Posts

Databricks Runtime 5.2 Released

Nakul Jamadagni announces Databricks Runtime 5.2: Delta Time TravelTime Travel, released as an Experimental feature, adds the ability to query a snapshot of a table using a timestamp string or a version, using SQL syntax as well as DataFrameReader options for timestamp expressions.Sample codeSELECT count() FROM events TIMESTAMP AS OF timestamp_expressionSELECT count() FROM events VERSION AS OF version Time travel looks a bit like temporal tables in SQL Server.

Read More

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers: Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. […]

Read More

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031