ZotZero and BusySponge

I have been reading of ZotZero in Josh's blog and am hopeful that it will help bridge the gap between the dynamic and informal life of the Web (e.g., reading, blogging, bookmarks, RSS, etc.) and the seemingly lifeless task of bibliography. Wouldn't it be nice if citing something was as easy as bookmarking it? Or, if you could read what your colleagues were reading via an RSS feed?

While I haven't played with ZotZero yet -- and I use the Konqueror browser not Firefox -- I share this vision and hope to see it become a reality. And since I recently posted of my Freemind Extract tool (for transforming a mindmap into a bibliography) I realize I haven't spoken of the flipside a couple of years: absorbing information. But first, a historical digression.

The way I make note of and annotate resources and tasks evolved out of two practices at the W3C. The first of which was a decree by Timbl which I objected to strongly at the time: the great datespace shift of 1999. Because the W3C's root file/name space was getting too crowded, Tim's new policy forbid new top-level spaces like www.w3.org/Signature or www.w3.org/Encryption. There were too many already and who were we to lay claim to such spaces for all time? There might be a new digital signature activity 10 years from now, so where would they live? (Consequently, the subsequent key management working group received www.w3.org/2001/XKMS.) I appreciated this concern at the root level, but cringed at only being able to organize other files by date of creation. Try finding a document you wrote a couple of years ago in a space no more structured than /2001/{01,..,12} and is shared by 50+ other people. It's not easy. I realize the only way I could keep track of things I had worked on was to have a log of events and documents I cared about. (This shift also affected how we collaborated in our shared space given issues of ownership, access controls, and version management -- but perhaps more on that another time.)

The second W3C practice was that each of its hosts (worksites) had a weekly meeting at which we shared the important events of the past week and raised agenda issues for common discussion. To make it easier for the minute takers we e-mailed two minutes to an e-mail list and a bot would collect them into draft minutes which would be augmented with the IRC log.

Preparing my two minutes before 10 a.m. Tuesday morning always seemed more frantic than it need be. But, once I started keeping a log of what I had done as a result of the datespace shift, it became trivial. (In fact, I wrote a script to grab the past week automatically, and even generated a RSS feed from the work log so that one could "subscribe" to my work log by keyword/task -- anticipating RSS feeds of tagged bookmarks.)

By 2002 I had tired of manually logging events, via an HTML editor, to my personal blog and work log, so I wrote a specification for a dream tool: Busy Sponge. It would soak up everything I touched of importance and send it to the right place. I opted for a commandline tool I named b.py.

Returning to today, and a challenge I'm sure I share with the ZotZero folks, is how to automatically scrape as much metadata as possible from a Web resource? Busy Sponge continues to be the primary way I input data into my work log and mind maps. Because metadata is no more common or standard on the Web as it was five years ago I am dependent on screen scraping heuristics. For example, the following code allows me to easily capture and cite messages of Wikipedia mailing lists -- and that is why it was such a hassle when the archives broke:

elif url.startswith("http://marc.theaimsgroup.com/"):
        author = re.search('''From: *(.*?)''', html).group(1)
    except AttributeError:
        author = re.search('''From: *(.*)''', html).group(1)
    author = author.replace(' () ','@').replace(' ! ','.')\
        .replace('&lt;', '<').replace('&gt;', '>')
    author = author.split(' <')[0]
    author = author.replace('"','')

    mlist = re.search('''List: *(.*?)''', html).group(1)

    mdate = re.search('''Date: *(.*?)''', html).group(1)

Unfortunately, beyond a couple mailing list archives and wikis -- which, fortunately, are the majority of what I grab -- I have to manually edit my sponges with proper meta/bibliographic data. And curses upon those bloggers who make it difficult to determine the author of an article or even the whole blog -- even a pseudonym will do! Beyond the usage of my tool, I can imagine much value in a social tool that allows users to share annotations, or even screen-scraping "plug-ins." One can hope!

Comments !