Spark 2.0 Out

Apache Spark 2.0 has officially been released.  Vinay Shukla gives us some highlights:

Performance
Project Tungsten has completed another major phase and with new completely new stage code generation, significant performance improvements have been delivered. Parquet and ORC file processing have also delivered performance improvements.

Databricks Community Edition offers (tiny) free clusters with Spark 2.0 on top of Scala 2.10 and Scala 2.11.

Related Posts

Working With The Databricks API Via Powershell

Gerhard Brueckl has a Powershell module for interacting with Databricks, either Azure or AWS: As most of our deployments use PowerShell I wrote some cmdlets to easily work with the Databricks API in my scripts. These included managing clusters (create, start, stop, …), deploying content/notebooks, adding secrets, executing jobs/notebooks, etc. After some time I ended […]

Read More

Tuning Apache Spark Applications

Vidisha Gupta has a few tips for tuning Apache Spark programs: Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries – Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and […]

Read More

Categories

July 2016
MTWTFSS
« Jun Aug »
 123
45678910
11121314151617
18192021222324
25262728293031