Bibliography and Bitrot

Joseph Reagle

Three stories of bits

found
lost
and preserved

Bits found

Nupedia archives (2006)

wget --exclude-domains gizmology.net -e robots=off -nH --cut-dirs=3 
--base=http://web.archive.org/web/20030822044803/
http://www.nupedia.com/pipermail/nupedia-l/ -r -l 4 -N -k -p -R js 
-Gbase http://web.archive.org/web/20030822044803/
http://www.nupedia.com/pipermail/nupedia-l/

Wikipedia 10K (2010)

Bits lost

Digital posterity (2008)

So, doing a quick check-link analysis of the largest mindmap I find the following: 941 of those resources are “OK”; 21 are “404” (no longer there); and 10 “Timeout”. So, just within a few years ~2% aren’t readily available. For example, the link to Sanger’s 2005 information about his (then) new Digital Universe project is already broken; but I must say news sites are the worst.

Legal links (2013)

We found that half of the links in all Supreme Court opinions no longer work. And more than 70% of the links in such journals as the Harvard Law Review (in that case measured from 1999 to 2012), currently don’t work. As time passes, the number of non-working links increases.

Bits preserved

Wikipedia sources (2008)

wget --restrict-file-names=windows -c --recursive --level=1 
--span-hosts --convert-links --execute robots=off -t 4 
https://reagle.org/joseph/2008/02/wp-srcs/field-note-cats.html

perma.cc

perma.cc helps authors and journals create permanent archived citations in their published work

Bibliography and Bitrot

Three stories of bits

Bits found

Nupedia archives (2006)

Wikipedia 10K (2010)

Bits lost

Digital posterity (2008)

Legal links (2013)

Bits preserved

Wikipedia sources (2008)

perma.cc

What to do?

What can we learn from physical media archival practices?

What archival practices should digital scholars be learning?

Should efforts like perma.cc be expanded?