Using rquery On Databricks

Kevin Feasel


Hadoop, R, Spark

Nina Zumel and John Mount talk about rquery, a relational data transformation engine for R which runs on Spark:

rquery is based on an appreciation of Codds’ relational algebra. Codd’s relational algebra is a formal algebra that describes the semantics of data transformations and queries. Previous, hierarchical, databases required associations to be represented as functions or maps. Codd relaxed this requirement from functions to relations, allowing tables that represent more powerful associations (allowing, for instance, two-way multimaps).

Codd’s work allows most significant data transformations to be decomposed into sequences made up from a smaller set of fundamental operations:

  • select (row selection)
  • project (column selection/aggregation)
  • Cartesian product (table joins, row binding, and set difference)
  • extend (derived columns, keyword was in Tutorial-D).

One of the earliest and still most common implementation of Codd’s algebra is SQL. Formally Codd’s algebra assumes that all rows in a table are unique; SQL further relaxes this restriction to allow multisets.

rquery is another realization of the Codd algebra that implements the above operators, some higher-order operators, and emphasizes a right to left pipe notation. This gives the Spark user an additional way to work effectively.

They include a fairly lengthy example and give a great introduction to the tool.  It’s now officially on my list of stuff to try out.

Related Posts

Linear Regression in Power BI

Joseph Yeates shows how to implement linear regression in Power BI: The goal of a simple linear model is to fit a line onto this plot to summarize the shape of the data using the equation above. The “a” value is the slope of the fitted line (rise over run) and the “b” value is […]

Read More

Calculating YARN Utilization Metrics

Dmitry Tolpeko shows how you can calculate per-second cluster utilization measures from YARN’s resource manager logs: But even if you query YARN REST API every second it still can only provide a snapshot of the used YARN resources. It does not show which application allocates or releases containers, their memory and CPU capacity, in which […]

Read More


July 2018
« Jun Aug »