R For Apache Impala

Kevin Feasel


Hadoop, R

Ian Cook describes implyr, an R interface for Apache Impala:

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate()select()filter()summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

It looks like this project is off to a good start already.

Related Posts

Last-Click Attribution With Databricks Delta

Caryl Yuhas and Denny Lee give us an example of building a last-click digital marketing attribution model with Databricks Delta: The first thing we will need to do is to establish the impression and conversion data streams.   The impression data stream provides us a real-time view of the attributes associated with those customers who were served the […]

Read More

Microsoft R Open 3.5.1

David Smith announces Microsoft R Open 3.5.1: Microsoft R Open 3.5.1 has been released, combining the latest R language engine with multi-processor performance and tools for managing R packages reproducibly. You can download Microsoft R Open 3.5.1 for Windows, Mac and Linux from MRAN now. Microsoft R Open is 100% compatible with all R scripts and packages, and works with […]

Read More


July 2017
« Jun Aug »