Text Normalization With Spark

Engineers at Treselle Systems have put together a two-part series on text normalization using Apache Spark.  First, they walk through normalizing the text:

We have used Spark shared variable “broadcast” to achieve distributed caching. Broadcast variables are useful when large datasets need to be cached in executors. “stopwords_en.txt” is not a large dataset but we have used in our use case to make use of that feature.

What are Broadcast Variables?
Broadcast variables in Apache Spark is a mechanism for sharing variables across executors that are meant to be read-only. Without broadcast variables, these variables would be shipped to each executor for every transformation and action, which can cause network overhead. However, with broadcast variables, they are shipped once to all executors and are cached for future reference.

From there, they dig into details on what the Spark engine did and why we see what we do:

Note: Stage 2 has both reduceByKey() and sortByKey() operations and as indicated in job summary “saveAsTextFile()” action triggered Job 2. Do you have any guess whether Stage 2 will be further divided into other stages in Job 2? The answer is: yes Job 2 DAG: This job is triggered due to saveAsTextFile() action operation. The job DAG clearly indicates the list of operations used before the saveAsTextFile() operations.Stage 2 in Job 1 is further divided into another stage as Stage 2. In Stage 2 has both reduceByKey() and sortByKey() operations and both operations can shuffle the data so that Stage 2 in Job 1 is broken down into Stage 4 and Stage 5 in Job 2. There are three stages in this job. But, Stage 3 is skipped. The answer for the skipped stage is provided below “What does “Skipped Stages” mean in Spark?” section.

There’s some good information here if you want to become more familiar with how Spark works.

Related Posts

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto: Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. […]

Read More

Converting Spark RDDs To DataFrames

Franco Cano shows us how we can convert a Resilient Distributed Dataset in Spark to a DataFrame: Sometimes you need transform you RRDs to DataFrames because DataFrames have a lot optimization options.Let’s see how this is done. This works, but there is a performance cost to converting a large RDD to a DataFrame (or vice versa). […]

Read More

Categories

April 2017
MTWTFSS
« Mar May »
 12
3456789
10111213141516
17181920212223
24252627282930