Text Normalization With Spark

Engineers at Treselle Systems have put together a two-part series on text normalization using Apache Spark.  First, they walk through normalizing the text:

We have used Spark shared variable “broadcast” to achieve distributed caching. Broadcast variables are useful when large datasets need to be cached in executors. “stopwords_en.txt” is not a large dataset but we have used in our use case to make use of that feature.

What are Broadcast Variables?
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables, these variables would be shipped to each executor for every transformation and action, which can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

From there, they dig into details on what the Spark engine did and why we see what we do:

Note: Stage 2 has both reduceByKey() and sortByKey() operations and as indicated in job summary “saveAsTextFile()” action triggered Job 2. Do you have any guess whether Stage 2 will be further divided into other stages in Job 2? The answer is: yes Job 2 DAG: This job is triggered due to saveAsTextFile() action operation. The job DAG clearly indicates the list of operations used before the saveAsTextFile() operations.Stage 2 in Job 1 is further divided into another stage as Stage 2. In Stage 2 has both reduceByKey() and sortByKey() operations and both operations can shuffle the data so that Stage 2 in Job 1 is broken down into Stage 4 and Stage 5 in Job 2. There are three stages in this job. But, Stage 3 is skipped. The answer for the skipped stage is provided below “What does “Skipped Stages” mean in Spark?” section.

There’s some good information here if you want to become more familiar with how Spark works.

Related Posts

Methods To Improve Model Accuracy

Tristan Robinson shows how to go back to the drawing board when your model’s accuracy isn’t cutting it: One of the reoccurring principles that appears with machine learning is that of Ockham’s razor, which states that the best models are simple models that fit the data well; this is not an irrefutable principle of logic, but […]

Read More

JupyterLab Now Available

Project Jupyter announces the general availability of JupyterLab: JupyterLab is an interactive development environment for working with notebooks, code and data. Most importantly, JupyterLab has full support for Jupyter notebooks. Additionally, JupyterLab enables you to use text editors, terminals, data file viewers, and other custom components side by side with notebooks in a tabbed work area. JupyterLab […]

Read More

Categories

April 2017
MTWTFSS
« Mar May »
 12
3456789
10111213141516
17181920212223
24252627282930