Ten Notes On SparkR

Kevin Feasel

2017-01-03

R, Spark

Neil Dewar has a notebook with ten important things when migrating from R to SparkR:

  1. Apache Spark Building Blocks. A high-level overview of Spark describes what is available for the R user.

  2. SparkContext, SQLContext, and SparkSession. In Spark 1.x, SparkContext and SQLContext let you access Spark. In Spark 2.x, SparkSession becomes the primary method.

  3. A DataFrame or a data.frame? Spark’s distributed DataFrame is different from R’s local data.frame. Knowing the differences lets you avoid simple mistakes.

  4. Distributed Processing 101. Understanding the mechanics of Big Data processing helps you write efficient code—and not blow up your cluster’s master node.

  5. Function Masking. Like all R libraries, SparkR masks some functions.

  6. Specifying Rows. With Big Data and Spark, you generally select rows in DataFrames differently than in local R data.frames.

  7. Sampling. Sample data in the right way, and use it as a tool for converting between big and small data.

  8. Machine Learning. SparkR has a growing library of distributed ML algorithms.

  9. Visualization.It can be hard to visualize big data, but there are tricks and tools which help.

  10. Understanding Error Messages. For R users, Spark error messages can be daunting. Knowing how to parse them helps you find the relevant parts.

I highly recommend checking out the notebook.

Related Posts

Data Science And Data Engineering In HDP 3.0

Saumitra Buragohain, et al, show off some of the things added to the Hortonworks Data Platform for data scientists and data engineers: We leverage the power of HDP 3.0 from efficient storage (erasure coding), GPU pooling to containerized TensorFlow and Zeppelin to enable this use case. We will the save the details for a different […]

Read More

Using wrapr For A Consistent Pipe With ggplot2

John Mount shows how you can use the wrapr pipe to perform data processing and building a ggplot2 visual: Now we can run a single pipeline that combines data processing steps and ggplot plot construction. data.frame(x = 1:20) %.>% mutate(., y = cos(3*x)) %.>% ggplot(., aes(x = x, y = y)) %.>% geom_point() %.>% geom_line() %.>% ggtitle("piped ggplot2") Check […]

Read More

Categories

January 2017
MTWTFSS
« Dec Feb »
 1
2345678
9101112131415
16171819202122
23242526272829
3031