Text Normalization With Spark

Engineers at Treselle Systems have put together a two-part series on text normalization using Apache Spark.  First, they walk through normalizing the text:

We have used Spark shared variable “broadcast” to achieve distributed caching. Broadcast variables are useful when large datasets need to be cached in executors. “stopwords_en.txt” is not a large dataset but we have used in our use case to make use of that feature.

What are Broadcast Variables?
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables, these variables would be shipped to each executor for every transformation and action, which can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

From there, they dig into details on what the Spark engine did and why we see what we do:

Note: Stage 2 has both reduceByKey() and sortByKey() operations and as indicated in job summary “saveAsTextFile()” action triggered Job 2. Do you have any guess whether Stage 2 will be further divided into other stages in Job 2? The answer is: yes Job 2 DAG: This job is triggered due to saveAsTextFile() action operation. The job DAG clearly indicates the list of operations used before the saveAsTextFile() operations.Stage 2 in Job 1 is further divided into another stage as Stage 2. In Stage 2 has both reduceByKey() and sortByKey() operations and both operations can shuffle the data so that Stage 2 in Job 1 is broken down into Stage 4 and Stage 5 in Job 2. There are three stages in this job. But, Stage 3 is skipped. The answer for the skipped stage is provided below “What does “Skipped Stages” mean in Spark?” section.

There’s some good information here if you want to become more familiar with how Spark works.

Related Posts

Picking A Python IDE

Kevin Jacobs reviews a few Python IDEs from the perspective of a data scientist: Ladies and gentlemens, this is one of the most perfect IDEs for editing your Python code! At least in my opinion. Jupyter notebook is a web based code editor and can quickly generate visualizations. You can mix up code and text […]

Read More

Handling Imbalanced Data

Tom Fawcett shows us how to handle a tricky classification problem: The primary problem is that these classes are imbalanced: the red points are greatly outnumbered by the blue. Research on imbalanced classes often considers imbalanced to mean a minority class of 10% to 20%. In reality, datasets can get far more imbalanced than this. […]

Read More


April 2017
« Mar May »