<?xml version="1.0" encoding="iso-8859-1"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<title type="text">Joseph Reagle</title>
<subtitle type="html"><![CDATA[
Open Communities, Media, Source, and Standards
]]></subtitle>
<id>http://reagle.org/joseph/blog/technology/python/zotzero-busy-sponge</id>
<link rel="alternate" type="text/html" href="http://reagle.org/joseph/blog" />
<link rel="self" type="application/atom+xml" href="http://reagle.org/joseph/blog/technology/python/zotzero-busy-sponge?flav=atom" />


<author>
<name>Joseph Reagle</name>
<uri>http://reagle.org/joseph/blog/technology/python/zotzero-busy-sponge</uri>
<email></email>
</author>
<rights>Copyright 2003-2010 Joseph Reagle</rights>
<generator uri="http://pyblosxom.sourceforge.net/" version="1.4.3 01/10/2008">
PyBlosxom http://pyblosxom.sourceforge.net/ 1.4.3 01/10/2008
</generator>

<updated>2007-02-08T14:43:52Z</updated>
<!-- icon?  logo?  -->

<entry>
<title type="html">ZotZero and BusySponge</title>
<category term="" />
<id>http://reagle.org/joseph/blog/2007/02/08/zotzero-busy-sponge</id>
<updated>2007-02-08T14:43:52Z</updated>
<published>2007-02-08T14:43:52Z</published>
<link rel="alternate" type="text/html" href="http://reagle.org/joseph/blog/technology/python/zotzero-busy-sponge.html" />
<content type="html">

&lt;p&gt;I have been reading of &lt;a href=&quot;http://www.zotero.org/&quot;&gt;ZotZero&lt;/a&gt; in &lt;a href=&quot;http://www.epistemographer.com/&quot;&gt;Josh&apos;s&lt;/a&gt;
blog and am hopeful that it will help bridge the gap between the
dynamic and informal life of the Web (e.g., reading, blogging,
bookmarks, RSS, etc.) and the seemingly lifeless task of bibliography.
Wouldn&apos;t it be nice if citing something was as easy as bookmarking it?
Or, if you could read what your colleagues were reading via an RSS feed?&lt;/p&gt;

&lt;p&gt;While I haven&apos;t played with &lt;a href=&quot;http://www.zotero.org/&quot;&gt;ZotZero&lt;/a&gt;
yet -- and I use the Konqueror browser not Firefox -- I share this
vision and hope to see it become a reality. And since I recently posted
of my Freemind Extract tool (for transforming a mindmap into a
bibliography) I realize I haven&apos;t spoken of the flipside a couple of
years: absorbing information. But first, a historical digression.&lt;/p&gt;

&lt;p&gt;The way I make note of and annotate resources and tasks evolved out
of two practices at the W3C. The first of which was a decree by &lt;a href=&quot;http://en.wikipedia.org/wiki/Berners-Lee&quot;&gt;Timbl&lt;/a&gt;
which I objected to strongly at the time: the great datespace shift of
1999. Because the W3C&apos;s root file/name space was getting too crowded,
Tim&apos;s new policy forbid new top-level spaces like&amp;nbsp;&lt;code&gt;www.w3.org/Signature&lt;/code&gt; or&amp;nbsp;&lt;code&gt;www.w3.org/Encryption&lt;/code&gt;.
There were too many already and who were we to lay claim to such spaces
for all time? There might be a new digital signature activity 10 years
from now, so where would they live? (Consequently, the subsequent key
management working group received&amp;nbsp;&lt;code&gt;www.w3.org/2001/XKMS&lt;/code&gt;.)
I appreciated this concern at the root level, but cringed at only being
able to organize other files by date of creation. Try finding a
document you wrote a couple of years ago in a space no more structured
than&amp;nbsp;&lt;code&gt;/2001/{01,..,12}&lt;/code&gt; and is shared by 50+ other
people. It&apos;s not easy. I realize the only way I could keep track of
things I had worked on was to have a log of events and documents I
cared about. (This shift also affected how we collaborated in our
shared space given issues of ownership, access controls, and version
management -- but perhaps more on that another time.)&lt;/p&gt;

&lt;p&gt;The second W3C practice was that each of its hosts (worksites) had a
weekly meeting at which we shared the important events of the past week
and raised agenda issues for common discussion. To make it easier for
the minute takers we e-mailed two minutes to an e-mail list and a bot
would collect them into draft minutes which would be augmented with the
IRC log.&lt;/p&gt;

&lt;p&gt;Preparing my two minutes before 10 a.m. Tuesday morning always
seemed more frantic than it need be. But, once I started keeping a log
of what I had done as a result of the datespace shift, it became
trivial. (In fact, I wrote a script to grab the past week
automatically, and even generated a RSS feed from the work log so that
one could &quot;subscribe&quot; to my work log by keyword/task -- anticipating
RSS feeds of tagged bookmarks.)&lt;/p&gt;

&lt;p&gt;By 2002 I had tired of manually logging events, via an HTML editor,
to my personal blog and work log, so I wrote a specification for a
dream tool: &lt;a href=&quot;http://www.w3.org/2002/08/busy-spunge.html&quot;&gt;Busy Sponge&lt;/a&gt;. It would soak up everything I touched of importance and send it to the right place. I opted for a commandline tool I named &lt;code&gt;b.py&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;Returning to today, and a challenge I&apos;m sure I share with
the&amp;nbsp;ZotZero folks, is how to automatically scrape as much metadata
as
possible from a Web resource? Busy Sponge continues to be the primary
way I input data into my work log and mind maps. Because metadata is no
more common or standard on the Web as it was five years ago I am
dependent on screen scraping heuristics. For example, the following
code allows me to easily capture and cite messages of Wikipedia mailing
lists -- and that is why it was such a hassle when the archives &lt;a href=&quot;http://reagle.org/joseph/blog/method/broken-lists?showcomments=yes&quot;&gt;broke&lt;/a&gt;:&lt;/p&gt;

&lt;pre style=&quot;font-size:9pt&quot;&gt;
elif url.startswith(&quot;http://marc.theaimsgroup.com/&quot;):
	try:
		author = re.search(&apos;&apos;&apos;From: *&lt;a href=&quot;.*?&quot;&gt;(.*?)&lt;/a&gt;&apos;&apos;&apos;, html).group(1)
	except AttributeError:
		author = re.search(&apos;&apos;&apos;From: *(.*)&apos;&apos;&apos;, html).group(1)
	author = author.replace(&apos; () &apos;,&apos;@&apos;).replace(&apos; ! &apos;,&apos;.&apos;)\
		.replace(&apos;&amp;amp;lt;&apos;, &apos;&amp;lt;&apos;).replace(&apos;&amp;amp;gt;&apos;, &apos;&amp;gt;&apos;)
	author = author.split(&apos; &amp;lt;&apos;)[0]
	author = author.replace(&apos;&quot;&apos;,&apos;&apos;)

	mlist = re.search(&apos;&apos;&apos;List: *&lt;a href=&quot;.*?&quot;&gt;(.*?)&lt;/a&gt;&apos;&apos;&apos;, html).group(1)

	mdate = re.search(&apos;&apos;&apos;Date: *&lt;a href=&quot;.*?&quot;&gt;(.*?)&lt;/a&gt;&apos;&apos;&apos;, html).group(1)
    ...
&lt;/pre&gt;

&lt;p&gt;Unfortunately, beyond a couple mailing list archives and wikis --
which, fortunately, are the majority of what I grab -- I have to
manually edit my sponges with proper meta/bibliographic data. And
curses upon those bloggers who make it difficult to determine the
author of an article or even the whole blog -- even a pseudonym will
do! Beyond the usage of my tool, I can imagine much value in a social
tool that allows users to share annotations, or even screen-scraping
&quot;plug-ins.&quot; One can hope!&lt;/p&gt;</content>
</entry>
</feed>
