Going From Pig To Spark

Philippe de Cuzey introduces Spark to people already familiar with Pig:

I like to think of Pig as a high-level Map/Reduce commands pipeline. As a former SQL programmer, I find it quite intuitive, and at my organization our Hadoop jobs are still mostly developed in Pig.

Pig has a lot of qualities: it is stable, scales very well, and integrates natively with the Hive metastore HCatalog. By describing each step atomically, it minimizes conceptual bugs that you often find in complicated SQL code.

But sometimes, Pig has some limitations that makes it a poor programming paradigm to fit your needs.

Philippe includes a couple of examples in Pig, PySpark, and SparkSQL.  Even if you aren’t familiar with Pig, this is a good article to help familiarize yourself with Spark.

Related Posts

Kafka Connect Converters And Serialization

Robin Moffatt goes into great detail on Apache Kafka Connect converters and serialization techniques: Kafka Connect is modular in nature, providing a very powerful way of handling integration requirements. Some key components include: Connectors – the JAR files that define how to integrate with the data store itself Converters – handling serialization and deserialization of […]

Read More

Tuning Apache Spark Applications

Vidisha Gupta has a few tips for tuning Apache Spark programs: Data Serialization – Serialization plays an important role in increasing the performance of any application. Spark provides two serialization libraries – Java Serialization: By default, spark uses Java’s ObjectOutputStream framework which can work with any class that implements java.io.serializable. This serialization is flexible but slow and […]

Read More

Categories

July 2016
MTWTFSS
« Jun Aug »
 123
45678910
11121314151617
18192021222324
25262728293031