Vincent-Philippe Lauzon shows how to perform data frame transformations using PySpark:
We wanted to look at some more Data Frames, with a bigger data set, more precisely some transformation techniques. We often say that most of the leg work in Machine learning in data cleansing. Similarly we can affirm that the clever & insightful aggregation query performed on a large dataset can only be executed after a considerable amount of work has been done into formatting, filtering & massaging data: data wrangling.
Here, we’ll look at an interesting dataset, the H-1B Visa Petitions 2011-2016 (from Kaggle) and find some good insights with just a few queries, but also some data wrangling.
It is important to note that about everything in this article isn’t specific to Azure Databricks and would work with any distribution of Apache Spark.
The notebook used for this article is persisted on GitHub.
Read on for explanation, or check out the notebook to work on it at your own pace.