Archiving sources

Later this week I'll be participating on a panel of the New Media in American Literary History Interdisciplinary Symposium. I plan to talk about bibliography and bitrot.

When I began my historical study of Wikipedia I was sad to note that much early material was lost to bit rot. Hence, I was pleased to find remnants of the Nupedia lists (Wikipedia's predecessor) on the Way Back Machine and create and archive -- by way of wget -- for others. Similarly, the first edits to Wikipedia were considered lost until Tim Starling found old logs files from Wikipedia. I used these to reconstruct the first ten thousand contributions, about six weeks' worth of edits, to Wikipedia.

These were nice finds. But these were two steps forward in an otherwise persistent jog backwards. For instance, in 2008 I noted that in my own writing my references to contemporary sources were quickly rotting.

So, doing a quick check-link analysis of the largest mindmap I find the following: 941 of those resources are "OK"; 21 are "404" (no longer there); and 10 "Timeout". So, just within a few years ~2% aren't readily available. For example, the link to Sanger's 2005 information about his (then) new Digital Universe project is already broken; but I must say news sites are the worst.

Hence, when I finished my dissertation I took the step of crawling all my sources to create and share an archive.

Others have begun to note this problem as well. Earlier this year Zittrain, Albert, and Lessig wrote of their work in the legal domain.

We found that half of the links in all Supreme Court opinions no longer work. And more than 70% of the links in such journals as the Harvard Law Review (in that case measured from 1999 to 2012), currently don't work. As time passes, the number of non-working links increases.

Hence a number of legal libraries have launched, which aims to make the process of archiving and citing online sources much easier and consistent than my own efforts with wget. I hope this effort spreads well beyond the legal discipline.

