R For Apache Impala

Kevin Feasel


Hadoop, R

Ian Cook describes implyr, an R interface for Apache Impala:

dplyr provides a grammar of data manipulation, consisting of set of verbs (including mutate()select()filter()summarise(), and arrange()) that can be used together to perform common data manipulation tasks. The implyr package helps dplyr translate this grammar into Impala-compatible SQL commands. This gives R users access to Impala’s scale and speed on large distributed datasets while using the same familiar dplyr syntax that they are accustomed to using on local data frames and other data sources. R users can also choose to directly write SQL commands and execute them on Impala using implyr.

implyr builds upon recent work from RStudio and other contributors, including major updates to the packages dplyr and DBI, and new packages dbplyr and odbc. implyr together with these packages enables data scientists and data engineers to more easily interact with Impala through self-service data science tools like Cloudera Data Science Workbench.

It looks like this project is off to a good start already.

Related Posts

Timing R Function Calls

Colin Gillespie shows off an R package for benchmarking: Of course, it’s more likely that you’ll want to compare more than two things. You can compare as many function calls as you want with mark(), as we’ll demonstrate in the following example. It’s probably more likely that you’ll want to compare these function calls against more […]

Read More

From pandas to Spark with koalas

Achilleus tries out Koalas: Python is widely used programming language when it comes to Data science workloads and Python has way too many different libraries to back this fact. Most of the data scientists are familiar with Python and pandas mostly. But the main issue with Pandas is it works great for small and medium […]

Read More


July 2017
« Jun Aug »