With Zotero and Mendeley providing excellent web-centric bibliographic tools, I don’t imagine I’ll ever get converts to Thunderdell. Still, it and BusySponge are integral to how I work and I enjoy the occasional Python programming. Some disjointed thoughts from the bibliographic screen-scraping front.
- I’m envious of Zotero’s many user-contributed translators for scraping a particular webpage for bibliographic data. I wish there was a command-line or API I could call and take advantage of.
- CrossRef now offers a DOI-to-bibliography Web service; BusySponge now uses this (via the JSON format).
- Even if one finds bibliographic data, it is often dirty. For instance, is the item’s title in title or sentence case? One can easily convert sentence to title case, but not vice versa. Hence, I’m quite pleased with change_case’s ability to detect title case and convert it to sentence case (i.e., detecting proper nouns that should remain capitalized).
- Because the New York Times redirects-with-a-cookie and httplib2 can’t easily
handle that, I now make use of the awesome Requests
(“HTTP for Humans”) – and web_little
is now little indeed!
 
- I’m not yet satisfied with my generic scrapers ability to identify an article’s author; I wonder if there are any tricks out there for this?
(I’ve been meaning to post an update on Thunderdell and an interesting conversation with Nathan Matias about his work prompted me to do so.)
Comments !