Using Spark For Investigation

Sean Owen tries to unravel the Tamam Shud mystery:

Several people have approached these letters as a cryptographic cipher. The odd circumstances of death do sound like something out of a John Le Carré spy novel. Some of the best attempts, however, fail to produce anything but truly convoluted parsings.

Another possibility may already have occurred to you: Are they the first letters of words in a sentence (aninitialism)? Some suspect this death was a suicide, and that the message is merely some form of final note. With this morbid scenario in mind, it’s easy to imagine many phrases, like “My Life Is All But Over,” that fit the letters because indeed their frequency seems to match that of English text.

This lead has been picked up a few times. These writeups (example) present indications that the message is indeed an initialism. However, they don’t apply what is arguably the clear statistical tool for this job. And they don’t take advantage of big data. So, let’s do both.

Read on for Chi Square testing and book parsing examples using Spark.  Spoiler alert:  Sean doesn’t solve the mystery, but it’s still a fun read.

Related Posts


John Mount explains the vtreat package that he and Nina Zumel have put together: When attempting predictive modeling with real-world data you quicklyrun into difficulties beyond what is typically emphasized in machine learning coursework: Missing, invalid, or out of range values. Categorical variables with large sets of possible levels. Novel categorical levels discovered during test, cross-validation, or […]

Read More

Wrapping Up A Data Science Project

I have finished my series on launching a data science project.  First, I have a post on deploying models as microservices: The other big shift is a shift away from single, large services which try to solve all of the problems.  Instead, we’ve entered the era of the microservice:  a small service dedicated to providing […]

Read More


September 2016
« Aug Oct »