Checking Functional Dependencies In R Data Frames

John Mount shows us how to use the psagg function in wrapr to ensure that functional dependencies are valid:

Notice only grouping columns and columns passed through an aggregating calculation (such as max()) are passed through (the column zis not in the result). Now because y is a function of x no substantial aggregation is going on, we call this situation a “pseudo aggregation” and we have taught this before. This is also why we made the seemingly strange choice of keeping the variable name y (instead of picking a new name such as max_y), we expect the y values coming out to be the same as the one coming in- just with changes of length. Pseudo aggregation (using the projection y[[1]]) was also used in the solutions of the column indexing problem.

Our wrapr package now supplies a special case pseudo-aggregator (or in a mathematical sense: projection): psagg(). It works as follows.

In this post, John calls the act of grouping functional dependencies (where we can determine the value of y based on the value of x, for any number of columns in y or x) pseudo-aggregation.

Related Posts

Kafka And The Differing Aims Of Data Professionals

Kai Waehner argues that there is an impedence mismatch between data engineers, data scientists, and ML production engineers: Data scientists love Python, period. Therefore, the majority of machine learning/deep learning frameworks focus on Python APIs. Both the stablest and most cutting edge APIs, as well as the majority of examples and tutorials use Python APIs. […]

Read More

Solving The Monty Hall Problem With R

Miroslav Rajter builds a Monty Hall problem simulator using R: The original and most simple scenario of the Monty Hall problem is this: You are in a prize contest and in front of you there are three doors (A, B and C). Behind one of the doors is a prize (Car), while behind others is […]

Read More


November 2018
« Oct Dec »