Getting Started With Spark

I discuss getting up and running with Databricks Community Edition:

There are a couple of notes with these clusters:

  1. These are not powerful clusters.  Don’t expect to crunch huge data sets with them.  Notice that the cluster has only 6 GB of RAM, so you can expect to get maybe a few GB of data max.

  2. The cluster will automatically terminate after one hour without activity.  The paid version does not have this limitation.

  3. You interact with the cluster using notebooks rather than opening a command prompt.  In practice, this makes interacting with the cluster a little more difficult, as a good command prompt can provide features such as auto-complete.

Databricks Community Edition has a nice interface, is very easy to get up and running and—most importantly—is free.  Read the whole thing.

Related Posts

Installing Jupyter Notebook Kernels

Nigel Meakins continues his Jupyter series by showing how to install various kernels: Jupyter-Scala This can be downloaded from here. Unzip and run the jupyter-scala.ps1 script on windows using elevated permissions in order to install. The kernel files will end up in <UserProfileDir>\AppData\Roaming\jupyter\kernels\scala-develop and the kernel will appear in Jupyter with the default name of ‘Scala (develop)’. You […]

Read More

Installing Apache Mesos On EC2

Anubhav Tarar has a guide for setting up Apache Mesos along with Spark and Hadoop on EC2: Apache Mesos is open source project for managing computer clusters originally developed at the University Of California. It sits between the application layer and operating system to manage the application works efficiently on the large-scale distributed environment. In […]

Read More


June 2016
« May Jul »