Wednesday, September 3, 2014

Dark data and the distribution of birth years

I'm working on a project that makes use of Day's Biographical Dictionary of the the History of Technology as a source. What I wanted to do required the book's information to be a database format, though. How do you go from "dark data" to something a computer can use? First, hope there is an ebook.

In this case, there is. I was able to get a PDF of the text, but it was still a book mean to be read and understood by humans. After thinking about it, I realized the format of the book was quite excellent and would allow for automated processing.

Each entry began with the name of an inventor, and ended with the initials of the editor (or editors) that worked on that section. In between those START and STOP markers were well defined details on things like birth (day, month, year, location), death (day, month, year), etc. There was also a page at the beginning of the book that listed all the editors and their abbreviations used at the end of each entry. Enter python, but practically any language could have parsed this. Without going into a lot of detail:
  • Read the pdf's editor index and stored it as a list and saved it to a file.
  • Read in the book's index of names and stored the names as a list and saved it as a file.
  • Read in the body of the pdf and stored it as text.
  • Using the editor list and the name list, I split the main body of the text into a giant list where each entry began with the name of someone in the index and ended with the initials of one of the editors.
Of course it wasn't that easy. For example, I discovered numerous instances where a name was spelled on way in the text, and another way in the index. After numerous rounds of cleanup to account for differences in names or oddities in formatting, I ended up with data I could write to a CSV for other processing.

Tonight, out of curiosity, I sat down and looked at the distribution of birth years for all inventors in my database born in or after 1690.


I am sure there are errors in the database since I have not throughly gone over it and checked entries, but this first pass showed N=1065 individuals, with a mean birth year of 1830 (median of 1834). I was somewhat surprised at how the numbers fell off as you get into the 20th century, but I have a thought as to why that might be. The book is a biographical take on the history of technology. For that reason, it is necessarily biased toward individuals, not for the work of teams of people. For example, who invented the atomic bomb? Yes, Leo Szilard famously came up with the idea and patented it, but he certainly didn't build one in his back shed. A team built The Gadget. 

Don't misunderstand me to be on the "the age of the lone inventor is dead" wagon. I'm not weighing in on that. I'm simply trying to explain what I'm seeing and guessing at the bias in this one source.