In 2002 I began thinking about how to best capture and share the many web-pages and small tasks of the day. I thought of it as a “busy sponge”: logging bookmarks and tasks to my team page with a minimum of typing. Furthermore, I wanted to tag each entry with a keyword which could then be used in queries. I posted an implementation in 2003 which was complemented by the fact that the tasks on my team page were syndicated (via RSS) – and used to generate my “two minute reports” at the weekly staff meeting at MIT. This was a number of years before the notion of micro-blogging became popular.
Two interesting features have further matured: I wanted it to fetch the title of a URL – typing HTML was a hassle – and I wanted to tell it which of my pages to log the entry to: my personal weblog or work team page. For example:
urd:/home/reagle > b http://pesto.redgecko.org/dispatch.html j python Noted ^
is a sponge of a URL to my “j” (work) page where “\^” becomes the hypertextual page title, resulting in:
<li class="event" id="e090313-f7fd">090313: python] <a href="http://pesto.redgecko.org/dispatch.html">Noted URL dispatch — Pesto: a library for WSGI applications</a></li>
With BusySponge 0.5 (now distributed as part of Thunderdell), this has matured into a set of classes for webpage screen scraping and a set of logger functions. So, for example, I might sponge a comment about a URL and indicate it should log it to my bibliographic mindmap (Thunderdell) and it will do its best to fetch the page author, title, date, publisher, permanent link, excerpt of first substantive paragraph, etc. The default heuristics do a surprisingly decent job – certainly better than typing it from scratch – and the specific scrapers (e.g., Wikipedia, MARC email archives) are quite good.