BusySponge and AlchemyAPI

When doing Web ethnography I capture many Web pages (blogs, conversations, etc.) for inclusion in my mindmap and bibliography. BusySponge's custom heuristics for scraping metadata (author, title, date, organization) from a specific site (e.g., Wikipedia or MARC email lists) are spot on. I'm fairly happy with the default heuristics for all other Web sites as well. Yet, if it fails, it fails on the Web page's author, which might appear at the page's top, bottom, or not at all. I've often wondered if there was a named entity recognition technique for identifying Web page authors. Hence, when I encountered AlchemyAPI I was able to easily make use of URLGetAuthor in BusySponge. I still prefer my heuristics since many authors use nonsensical pseudonyms. (That is, I use regular expressions that capture strings after "by" or strings near "posted on" dates.) But, if my techniques fail, AlchemyAPI is quite good at capturing traditional-type (i.e., first and last) European names.

Comments !

blogroll

social