CSV Import Speeds With H2O

Kevin Feasel



WenSui Liu benchmarks three CSV loading methods in R:

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

I’d wonder if there are cases where this would vary significantly; regardless, for reading a large data file, parallel processing does tend to be faster.

Related Posts

R Services Internals

Niels Berglund has an excellent series on R Services internals.  Here’s the latest post: This post is the ninth post about Microsoft SQL Server R Services, and the eight post that drills down into the internal of how it works. So far in this series we have been looking at what happens in SQL Server […]

Read More

Multiple Data Sets In External Scripts

Tomaz Kastrun shows a workaround to the “one data set” limit in sp_execute_external_script: Some of the  arguments of the procedure sp_execute_external_script are enumerated. This is valid for the inputting dataset and as the name of argument @input_data_1 suggests, one can easily (and this is valid doubt) think, there can also be @input_data_2 argument, and so on. Unfortunately, this is […]

Read More


June 2017
« May Jul »