The sinking of the RMS Titanic is probably the most tragic event in the modern maritime history. It was not the deadliest though but it became the symbol of the human bigotry in terms of technical advancement. Back in 1912, Titanic was the largest, and most luxurious ocean liner ever made. She was claimed to be unsinkable. And she sank on her maiden voyage to New York.
Today, modern data analytics is capable of modeling existing data in order to foresee possible outcomes. The sinking of Titanic is heavily documented and many data can be easily extracted and sorted. This is why one of the most popular challenges in Kaggle, a data science community is to create predictive analytics model examining the chances of survival for its passengers. By examination of factors like class, sex, and age, machine learnings algorithms can predict whether given passenger would have survived this disaster.
Data analytics enthusiast Patrick Triest made a detailed description of his own Kaggle kernel using Python. He relied on 13 variables to fuel his data model, including name, sex, age, the number of siblings aboard, ticket number, fare, cabin, survival, lifeboat number, etc.
Input data shows that you have just 38% survival chance if you were among the Titanic passengers or officers. The ocean liner had just 20 lifesaving boats, 4 of them collapsible with the collective capacity of 1178 passengers. Titanic had 2224 people onboard, both passengers and crew. The best-case scenario gives survival chance of 52.9%. Unfortunately, the actual outcome was way worse.
Statistical data of Titanic survivals shows that you have highest survival chance of 62% if you have a 1st class ticket, compared to 25.5% for 3rd class passengers. Divide this data set by gender and you`ll find out that if you were a woman with a 1st class ticket you had almost 100% survival chances. If you are on the opposite social scale with 3rd class ticket and being male reduces your survival chances way below 20%.
Patrick Triest is applying all known variables for every passenger to its actual survival. His next step is to run machine learning algorithm in order to detect any patterns on how different attributes values have an impact on the outcome. The next step of his data work is the creation of decision tree classifier. Its first branch is divided by sex, the second – by class and the third is the actual forecast.
This way, Triest achieved accuracy as of 77%, verified by the test data set. He also tried to model decision tree with neural networks, which are better than traditional machine learning in terms of ability to find patterns in unstructured data like images and natural language. All of his experiments gave actual accuracy up to 80%.
Probably the most interesting part of Patrick Triest data exercise is an actual cross between the model output for every passenger and its actual fate. For instance, the dead of almost all of the wealthy Allison family members is surprising, since Betsy Allison (wife) and Loraine (daughter) had almost 100% survival chances due to their gender and first class tickets. It turns out that Allison family wasn’t able to find their youngest son Trevor and was unwilling to evacuate the ship without him. In fact, the little Trevor Allison was already safe in a lifeboat with his nanny and became the only member of the family to survive the sinking of the RMS Titanic.
When working with such data, we shouldn’t think of human behavior as just data points. They are actual people which are predictable to some extent. The human side of the story cannot be always captured by statistical analysis and complicated math. They can give us valuable insights on trends, patterns, and correlations, that can be helpful to understand what and why has exactly happened.