Using rquery On Databricks

Kevin Feasel

2018-07-27

Hadoop, R, Spark

Nina Zumel and John Mount talk about rquery, a relational data transformation engine for R which runs on Spark:

rquery is based on an appreciation of Codds’ relational algebra. Codd’s relational algebra is a formal algebra that describes the semantics of data transformations and queries. Previous, hierarchical, databases required associations to be represented as functions or maps. Codd relaxed this requirement from functions to relations, allowing tables that represent more powerful associations (allowing, for instance, two-way multimaps).

Codd’s work allows most significant data transformations to be decomposed into sequences made up from a smaller set of fundamental operations:

  • select (row selection)
  • project (column selection/aggregation)
  • Cartesian product (table joins, row binding, and set difference)
  • extend (derived columns, keyword was in Tutorial-D).

One of the earliest and still most common implementation of Codd’s algebra is SQL. Formally Codd’s algebra assumes that all rows in a table are unique; SQL further relaxes this restriction to allow multisets.

rquery is another realization of the Codd algebra that implements the above operators, some higher-order operators, and emphasizes a right to left pipe notation. This gives the Spark user an additional way to work effectively.

They include a fairly lengthy example and give a great introduction to the tool.  It’s now officially on my list of stuff to try out.

Related Posts

Dealing With Multicollinearity With R

Chaitanya Sagar explains the concept of multicollinearity in linear regressions and how we can mitigate this issue in R: Perfect multicollinearity occurs when one independent variable is an exact linear combination of other variables. For example, you already have X and Y as independent variables and you add another variable, Z = a*X + b*Y, […]

Read More

Your Data’s Not That Big

Larry White throws a bit of cold water on the distributed computing movement: Someone recently told me about a data analysis application written in Python. He managed five Java engineers who built the cluster management and pipeline infrastructure needed to make the analysis run in the 12 hours allotted. They used Python, he said, because […]

Read More

Categories

July 2018
MTWTFSS
« Jun Aug »
 1
2345678
9101112131415
16171819202122
23242526272829
3031