Maximum Temperatures With Spark Languages

Kevin Feasel



Praveen Sripati has a two-part series on getting aggregates by year in various Spark languages.  In part one, he looks at Python:

Hadoop – The Definitive Guide revolves around the example of finding the maximum temperature for a particular year from the weather data set. The code for the same is here and the data here. Below is the Spark code implemented in Python for the same.

In part 2, he looks at Spark SQL:

In the previous blog, we looked at how find out the maximum temperature of each year from the weather dataset. Below is the code for the same using Spark SQL which is a layer on top of Spark. SQL on Spark was supported using Shark which is being replaced by Spark SQL.Here is a nice blog from DataBricks on the future of SQL on Spark.

There’s no Scala example here, but it’s pretty straightforward as well.

Related Posts

Apache Spark 2.3

The Databricks team has been busy.  They’ve recently announced Apache Spark 2.3 on Databricks: Continuing with the objectives to make Spark faster, easier, and smarter, Spark 2.3 marks a major milestone for Structured Streaming by introducing low-latency continuous processing and stream-to-stream joins; boosts PySpark by improving performance with pandas UDFs; and runs on Kubernetes clusters […]

Read More

Using Kafka And Elasticsearch For IoT Data

Angelos Petheriotis talks about building an IoT structure which handles ten billion messages per day: We splitted the pipeline into 2 main units: The aggregator job and the persisting job. The aggregator has one and only one responsibility. To read from the input kafka topic, process the messages and finally emit them to a new […]

Read More


November 2016
« Oct Dec »