R For Apache Impala

Kevin Feasel

2017-07-24

Hadoop, R

Ian Cook describes implyr, an R interface for Apache Impala:

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate()select()filter()summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

It looks like this project is off to a good start already.

Related Posts

Kafka Topic Reuse

Martin Kleppmann walks through the trade-offs of reusing Apache Kafka topics for different event types: The common wisdom (according to several conversations I’ve had, and according to a mailing list thread) seems to be: put all events of the same type in the same topic, and use different topics for different event types. That line of […]

Read More

The Theory Behind cdata

John Mount has a video explaining the concepts behind cdata: We also have two really nifty articles on the theory and methods: Fluid data reshaping with cdata Coordinatized Data: A Fluid Data Specification Please give it a try! Click through for the video, which I found very helpful in tying together a number of data […]

Read More

Categories

July 2017
MTWTFSS
« Jun Aug »
 12
3456789
10111213141516
17181920212223
24252627282930
31