Improving PDF Annotations from GoodReader

For many years now, I’ve printed out PDFs and scribbled annotations on them. I then dictate my annotations (i.e., excerpts and comments) into a text file that I can transform and include in my bibliographic mindmap system (see de.py in thunderdell).

With the purchase of an iPad—I gave up on waiting for a decent Android tablet—I’m now annotating PDFs via the GoodReader app. Of course, the accuracy of the text highlighted is only as good as the PDF. The copyable text, generated by OCR, can have conjoined words or suffer from errors resulting from misunderstood ligatures, accents, or cruft. Also, the actual page number of the PDF probably doesn’t correspond to the document’s pagination.

With the short python script gr-fix.py, I use a dictionary to correct OCR errors and transform from the GoodReader format into that used by de.py. This doesn’t correct everything (e.g., words with capitals) and can introduce a few errors itself—but it’s greatly improved on the original OCR. The --number argument also lets you correct the page numbers by an offset.

Comments !

blogroll

social