Improving PDF Annotations from GoodReader

For many years now, I’ve printed out PDFs and scribbled annotations on them. I then dictate my annotations (i.e., excerpts and comments) into a text file that I can transform and include in my bibliographic mindmap system (see extract-dictation.py in Thunderdell).

With the purchase of an iPad—I gave up on waiting for a decent Android tablet—I’m now annotating PDFs via the GoodReader app. Of course, the accuracy of the text highlighted is only as good as the PDF. The copyable text, generated by OCR, can have conjoined words or suffer from errors resulting from misunderstood ligatures, accents, or cruft. Also, the actual page number of the PDF probably doesn’t correspond to the document’s pagination.

With the short python script extract-goodreader.py, I use a dictionary to correct OCR errors and transform from the GoodReader format into that used by extract-dictation.py. This doesn’t correct everything (e.g., words with capitals) and can introduce a few errors itself—but it’s greatly improved on the original OCR. The --number argument also lets you correct the page numbers by an offset. If there’s a recent DOI number or 13-digit ISBN, it’ll grab the bibliography as well.

2020-07 Update: the Goodreader script is now part of Thunderdell itself and joined by extract-kindle.py and extract-instapaper.py.

Open Codex Code & Culture

Improving PDF Annotations from GoodReader

Comments !

Comments !

links

social