PySpark With MapR

Justin Brandenburg has a tutorial on combining Python and Spark on the MapR platform:

Looking at the first 5 records of the RDD

kddcup_data.take(5)
This output is difficult to read. This is because we are asking PySpark to show us data that is in the RDD format. PySpark has a DataFrame functionality. If the Python version is 2.7 or higher, you can utilize the pandas package. However, pandas doesn’t work on Python versions 2.6, so we use the Spark SQL functionality to create DataFrames for exploration.

The full example is a fairly simple k-means clustering process, which is a great introduction to PySpark.

Related Posts

Database-First or Kafka-First for Event Streaming

Gwen Shapiro takes us through a scenario where database-first writes for event streaming makes the most sense: Note that the DB does quite a lot for you: it enforces serializability, locks, your logical constraints, etc. If the DB is distributed (Vitesse, Cockroach, Spanner, Yugabyte), it does even more. If you were to go Kafka-first… well, […]

Read More

Accessing Azure Event Hubs with Python

Neil Gelder shows us how you can write Python code to work with Azure Event Hubs: I’ve supplied these two python scripts in my github repo at the following link. First we need to open the install the relevant python libraries so you’ll need to issue the below pip command in whatever command tool you use, […]

Read More

Categories

August 2016
MTWTFSS
« Jul Sep »
1234567
891011121314
15161718192021
22232425262728
293031