CSV Import Speeds With H2O

Kevin Feasel

2017-06-26

R

WenSui Liu benchmarks three CSV loading methods in R:

The importFile() function in H2O is extremely efficient due to the parallel reading. The benchmark comparison below shows that it is comparable to the read.df() in SparkR and significantly faster than the generic read.csv().

I’d wonder if there are cases where this would vary significantly; regardless, for reading a large data file, parallel processing does tend to be faster.

Related Posts

Reinforcement Learning with R

Holger von Jouanne-Diedrich takes us through concepts in reinforcement learning: At the core this can be stated as the problem a gambler has who wants to play a one-armed bandit: if there are several machines with different winning probabilities (a so-called multi-armed bandit problem) the question the gambler faces is: which machine to play? He could “exploit” one […]

Read More

Biases in Tree-Based Models

Nina Zumel looks at tree-based ensembling models like random forest and gradient boost and shows that they can be biased: In our previous article , we showed that generalized linear models are unbiased, or calibrated: they preserve the conditional expectations and rollups of the training data. A calibrated model is important in many applications, particularly when financial data […]

Read More

Categories

June 2017
MTWTFSS
« May Jul »
 1234
567891011
12131415161718
19202122232425
2627282930