I’ve thought of and discarded a number of ideas about what to work on for my data set project. I started looking through lists of publicly available data sets hoping something would catch my interest or inspiration would strike.
At https://github.com/caesar0301/awesome-public-datasets I came across a csv file with Titanic passenger survival data. The file listed passenger names, sex, cabin class, and ticket price as well as other information. I thought it could be an interesting set to work on.
The problem with finding information from a collection called “Awesome Public Datasets” is that there was absolutely no information about who created this file or where it came from. After some more digging I found a similar data set posted by the Department of Biostatistics at Vanderbilt University, complete with information about who created and where they found their information.
While the history of the Titanic and it’s passengers is well covered, but I like the idea of testing some of my assumptions about this data, and considering interesting or unexpected questions to explore.
For example I assume there will be a direct correlation between how much a passenger paid for their ticket, and whether or not they survived. I also expect to see a similar relationship with gender, with the assumption that women were more likely to survive than men. If both of those assumptions appear to be true after examining the data, then I’m curious whether wealth or gender will appear to be a more substantial factor.