Adding An Index To A Spark RDD

Kevin Feasel

2016-06-13

Spark

Arijit Tarafdar gives us a good method for adding an index column to a Spark data frame based on a non-unique value:

The basic idea is to create a lookup table of distinct categories indexed by unique integer identifiers. The way to avoid is to collect the unique categories to the driver, loop through them to add the corresponding index to each to create the lookup table (as Map or equivalent) and then broadcast the lookup table to all executors. The amount of data that can be collected at the driver is controlled by the spark.driver.maxResultSize configuration which by default is set at 1 GB for Spark 1.6.1. Both collect and broadcast will eventually run into the physical memory limits of the driver and the executors respectively at some point beyond certain number of distinct categories, resulting in a non-scalable solution.

The solution is pretty interesting:  build out a new RDD of unique results, and then join that set back.  If you’re using SQL (including Spark SQL), I would use the DENSE_RANK() window function.

Related Posts

Notebooks in Azure Databricks

Brad Llewellyn takes us through Azure Databricks notebooks: Azure Databricks Notebooks support four programming languages, Python, Scala, SQL and R.  However, selecting a language in this drop-down doesn’t limit us to only using that language.  Instead, it makes the default language of the notebook.  Every code block in the notebook is run independently and we […]

Read More

Reading and Writing CSV Files with spark-dotnet

Ed Elliott continues a series on Spark for .NET: How do you read and write CSV files using the dotnet driver for Apache Spark? I have a runnable example here:https://github.com/GoEddie/dotnet-spark-examples Specifcally:https://github.com/GoEddie/dotnet-spark-examples/tree/master/examples/split-csv The quoted links will take you straight to the code, but click through to see Ed’s commentary.

Read More

Categories

June 2016
MTWTFSS
« May Jul »
 12345
6789101112
13141516171819
20212223242526
27282930