Using Spark For Investigation

Sean Owen tries to unravel the Tamam Shud mystery:

Several people have approached these letters as a cryptographic cipher. The odd circumstances of death do sound like something out of a John Le Carré spy novel. Some of the best attempts, however, fail to produce anything but truly convoluted parsings.

Another possibility may already have occurred to you: Are they the first letters of words in a sentence (aninitialism)? Some suspect this death was a suicide, and that the message is merely some form of final note. With this morbid scenario in mind, it’s easy to imagine many phrases, like “My Life Is All But Over,” that fit the letters because indeed their frequency seems to match that of English text.

This lead has been picked up a few times. These writeups (example) present indications that the message is indeed an initialism. However, they don’t apply what is arguably the clear statistical tool for this job. And they don’t take advantage of big data. So, let’s do both.

Read on for Chi Square testing and book parsing examples using Spark.  Spoiler alert:  Sean doesn’t solve the mystery, but it’s still a fun read.

Related Posts

Converting CSV To ORC

Mark Litwintschik investigates whether Spark is faster at converting CSV files to ORC format than Hive or Presto: Spark, Hive and Presto are all very different code bases. Spark is made up of 500K lines of Scala, 110K lines of Java and 40K lines of Python. Presto is made up of 600K lines of Java. […]

Read More

Converting Spark RDDs To DataFrames

Franco Cano shows us how we can convert a Resilient Distributed Dataset in Spark to a DataFrame: Sometimes you need transform you RRDs to DataFrames because DataFrames have a lot optimization options.Let’s see how this is done. This works, but there is a performance cost to converting a large RDD to a DataFrame (or vice versa). […]

Read More


September 2016
« Aug Oct »