Solving Real-Life Mysteries with Big Data and Apache Spark
Can using simple statistical techniques in combination with big data help solve the Tamam Shud mystery?
Everyone loves a good real-life mystery. That’s why the three most popular TV shows of the 80s and 90s were Jack Palance’s reboot of Ripley’s Believe It or Not!, Unsolved Mysteries with Robert Stack, and Beyond Belief: Fact or Fiction hosted by Commander Riker. (Well…they were in my house, anyway.) At Cloudera, the highly-skilled support team has gotten good at cracking actual stranger-than-fiction cases like, “Why doesn’t this Kerberos ticket renew?” or, “Who deleted that table?”
In this spirit, on a recent random walk through Wikipedia links, I found a fascinating 68-year-old unsolved mystery known as the Tamam Shud case (aka “Mystery of the Somerton Man”). If you enjoyed Serial, then the story itself is worth a read, and worth watching. However for anyone who touches data for a living, the most intriguing part of the story will undoubtedly be this:
Facts of the Case
A man is found dead on a beach near Adelaide, Australia, in December 1948. Well-dressed and in good shape, he seems to have died from poisoning. His mundane possessions include no identification, but do include a scrap of a paper with the words, “Tamám Shud”. This phrase turns out to be the closing words (in Persian) of the Rubaiyat of Omar Khayyam, meaning “finished.”
Soon after, the very book from which it was torn is located. Inside its cover is scrawled the unlisted phone number of a local woman, along with this mysterious text:
To this day, nobody has conclusively explained its meaning, and the dead man has never been identified.
Several people have approached these letters as a cryptographic cipher. The odd circumstances of death do sound like something out of a John Le Carré spy novel. Some of the best attempts, however, fail to produce anything but truly convoluted parsings.
Another possibility may already have occurred to you: Are they the first letters of words in a sentence (an initialism)? Some suspect this death was a suicide, and that the message is merely some form of final note. With this morbid scenario in mind, it’s easy to imagine many phrases, like “My Life Is All But Over,” that fit the letters because indeed their frequency seems to match that of English text.
This lead has been picked up a few times. These writeups (example) present indications that the message is indeed an initialism. However, they don’t apply what is arguably the clear statistical tool for this job. And they don’t take advantage of big data. So, let’s do both.
The Chi-Squared Test
Does the frequency of letters in the Tamam Shud text resemble the frequency of first letters of English text? If so, that would be evidence that the text is an English initialism. But any given English sentence’s frequency of initials won’t exactly match English text as a whole. Rather, it will vary a bit depending on what sentences are chosen.
The well-known chi-squared test (χ2) can help. It takes as input some expected frequencies of discrete things occurring—like letters starting words—and actual observed counts of those things. It can then quantify the probability that a sample actually drawn from the expected frequencies would exhibit deviations from the expected frequencies as large or larger than that of the observed sample: a p-value. Although you may hear about degrees of freedom and the chi-squared statistic in this context, these are details that libraries will take care of in a context like this. This usage of the chi-squared test is known as a goodness-of-fit test.
This possibility that the observed counts come from the expected distribution is called the null hypothesis. In this hypothesis, the sample is not from a different distribution.
A low p-value means that the null hypothesis is unlikely. It’s evidence that the given initials did not come from a source whose frequencies match the expected frequencies. A high p-value is not quite evidence that the initials came from the same source; it just means that actual samples from the expected distribution would regularly show deviation from expected counts that are as large, or larger. Although it’s indicative that the initials came from the expected distribution, it’s not the probability that they did.