For many years now, I’ve printed out PDFs and scribbled annotations on them. I then dictate my annotations (i.e., excerpts and comments) into a text file that I can transform and include in my bibliographic mindmap system (see
extract-dictation.py in Thunderdell).
With the purchase of an iPad—I gave up on waiting for a decent Android tablet—I’m now annotating PDFs via the GoodReader app. Of course, the accuracy of the text highlighted is only as good as the PDF. The copyable text, generated by OCR, can have conjoined words or suffer from errors resulting from misunderstood ligatures, accents, or cruft. Also, the actual page number of the PDF probably doesn’t correspond to the document’s pagination.
With the short python script
extract-goodreader.py, I use a dictionary to correct OCR errors and transform from the GoodReader format into that used by
extract-dictation.py. This doesn’t correct everything (e.g., words with capitals) and can introduce a few errors itself—but it’s greatly improved on the original OCR. The
--number argument also lets you correct the page numbers by an offset. If there’s a recent DOI number or 13-digit ISBN, it’ll grab the bibliography as well.