So, before I do anything else, I want to get a sense of the data I’m looking at to see if it’s suitable for this project. I download the data, and because it’s gzipped, I use 7-zip to open it up on windows 10, or Windows explorer on Windows 11. In either case, the first thing I notice is the huge size disparity. When compressed, it is a quarter of a gigabyte. Uncompressed, it’s about 10 GB. This tells us something.
Read on to learn more about the dataset and how Eugene tackled some of the exploratory data analysis.
I also agree completely with Eugene’s point about serendipity. Keeping your metaphorical eyes open will increase the likelihood that you’ll just happen upon something that can help you later, or something that serves a need you didn’t know you had. I used to wander around the library back in my university days because I didn’t know what I didn’t know about topics (that is, the “unknown unknown” quadrant), so I’d just pick up some books that caught my eye. Not all of them are hits, though enough were to make the strategy worthwhile.