CSV Import Speeds With H2O

Kevin Feasel



WenSui Liu benchmarks three CSV loading methods in R:

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

I’d wonder if there are cases where this would vary significantly; regardless, for reading a large data file, parallel processing does tend to be faster.

Related Posts

Probabilities And Poker

Steve Miller has a notebook on 5-card draw probabilities: The population of 5 card draw hands, consisting of 52 choose 5 or 2598960 elements, is pretty straightforward both mathematically and statistically. So of course ever the geek, I just had to attempt to show her how probability and statistics converge. In addition to explaining the […]

Read More

Principal Component Analysis With Stack Overflow Data

Julia Silge explains Principal Component Analysis and shows us an example using Stack Overflow data: We have tidy data, both because that’s what I get when querying our databases and because it is useful for exploratory data analysis when preparing for a machine learning algorithm like PCA. To implement PCA, we need a matrix, and […]

Read More


June 2017
« May Jul »