When doing Web ethnography I capture many Web pages (blogs,
conversations, etc.) for inclusion in my mindmap and bibliography. BusySponge’s custom
heuristics for scraping metadata (author, title, date, organization)
from a specific site (e.g., Wikipedia or MARC email lists) are spot on.
I’m fairly happy with the default heuristics for all other Web sites as
well. Yet, if it fails, it fails on the Web page’s author, which might
appear at the page’s top, bottom, or not at all. I’ve often wondered if
there was a named
entity recognition technique for identifying Web page authors.
Hence, when I encountered AlchemyAPI I was able
to easily make use of URLGetAuthor in BusySponge. I still
prefer my heuristics since many authors use nonsensical pseudonyms.
(That is, I use regular expressions that capture strings after “by” or
strings near “posted on” dates.) But, if my techniques fail, AlchemyAPI
is quite good at capturing traditional-type (i.e., first and last)
European names.
Comments !