Bibliography and Bitrot

Joseph Reagle

Three stories of bits

  • found
  • lost
  • and preserved

Bits found

Nupedia archives (2006)

wget --exclude-domains gizmology.net -e robots=off -nH --cut-dirs=3 
--base=http://web.archive.org/web/20030822044803/
http://www.nupedia.com/pipermail/nupedia-l/ -r -l 4 -N -k -p -R js 
-Gbase http://web.archive.org/web/20030822044803/
http://www.nupedia.com/pipermail/nupedia-l/

Wikipedia 10K (2010)

Bits lost

Digital posterity (2008)

So, doing a quick check-link analysis of the largest mindmap I find the following: 941 of those resources are “OK”; 21 are “404” (no longer there); and 10 “Timeout”. So, just within a few years ~2% aren’t readily available. For example, the link to Sanger’s 2005 information about his (then) new Digital Universe project is already broken; but I must say news sites are the worst.

Bits preserved

Wikipedia sources (2008)

wget --restrict-file-names=windows -c --recursive --level=1 
--span-hosts --convert-links --execute robots=off -t 4 
http://reagle.org/joseph/2008/02/wp-srcs/field-note-cats.html

perma.cc

perma.cc helps authors and journals create permanent archived citations in their published work

What to do?

What can we learn from physical media archival practices?

What archival practices should digital scholars be learning?

Should efforts like perma.cc be expanded?