Using Spark For Investigation

Sean Owen tries to unravel the Tamam Shud mystery:

Several people have approached these letters as a cryptographic cipher. The odd circumstances of death do sound like something out of a John Le Carré spy novel. Some of the best attempts, however, fail to produce anything but truly convoluted parsings.

Another possibility may already have occurred to you: Are they the first letters of words in a sentence (aninitialism)? Some suspect this death was a suicide, and that the message is merely some form of final note. With this morbid scenario in mind, it’s easy to imagine many phrases, like “My Life Is All But Over,” that fit the letters because indeed their frequency seems to match that of English text.

This lead has been picked up a few times. These writeups (example) present indications that the message is indeed an initialism. However, they don’t apply what is arguably the clear statistical tool for this job. And they don’t take advantage of big data. So, let’s do both.

Read on for Chi Square testing and book parsing examples using Spark.  Spoiler alert:  Sean doesn’t solve the mystery, but it’s still a fun read.

Related Posts

When Spark Meets Hive

Anna Martin and Rosaria Silipo look at combining HiveQL and SparkQL: We set our goal here to investigate the age distribution of Maine residents, men and women, using SQL queries. But the question is… on Apache Hive or on Apache Spark? Well, why not both? We could use SparkSQL to extract men’s age distribution and […]

Read More

Taking A Random Walk

Dan Goldstein describes the basics of Brownian motion: I was sitting in a bagel shop on Saturday with my 9 year old daughter. We had brought along hexagonal graph paper and a six sided die. We decided that we would choose a hexagon in the middle of the page and then roll the die to […]

Read More

Categories

September 2016
MTWTFSS
« Aug Oct »
 1234
567891011
12131415161718
19202122232425
2627282930