Joseph Reagle
wget --exclude-domains gizmology.net -e robots=off -nH --cut-dirs=3
--base=http://web.archive.org/web/20030822044803/
http://www.nupedia.com/pipermail/nupedia-l/ -r -l 4 -N -k -p -R js
-Gbase http://web.archive.org/web/20030822044803/
http://www.nupedia.com/pipermail/nupedia-l/
So, doing a quick check-link analysis of the largest mindmap I find the following: 941 of those resources are “OK”; 21 are “404” (no longer there); and 10 “Timeout”. So, just within a few years ~2% aren’t readily available. For example, the link to Sanger’s 2005 information about his (then) new Digital Universe project is already broken; but I must say news sites are the worst.
We found that half of the links in all Supreme Court opinions no longer work. And more than 70% of the links in such journals as the Harvard Law Review (in that case measured from 1999 to 2012), currently don’t work. As time passes, the number of non-working links increases.
wget --restrict-file-names=windows -c --recursive --level=1
--span-hosts --convert-links --execute robots=off -t 4
https://reagle.org/joseph/2008/02/wp-srcs/field-note-cats.html
perma.cc helps authors and journals create permanent archived citations in their published work