<?xml version="1.0" encoding="iso-8859-1"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<title type="text">Joseph Reagle</title>
<subtitle type="html"><![CDATA[
Open Communities, Media, Source, and Standards
]]></subtitle>
<id>http://reagle.org/joseph/blog/technology/python/busysponge-0.5</id>
<link rel="alternate" type="text/html" href="http://reagle.org/joseph/blog" />
<link rel="self" type="application/atom+xml" href="http://reagle.org/joseph/blog/technology/python/busysponge-0.5?flav=atom" />


<author>
<name>Joseph Reagle</name>
<uri>http://reagle.org/joseph/blog/technology/python/busysponge-0.5</uri>
<email></email>
</author>
<rights>Copyright 2003-2010 Joseph Reagle</rights>
<generator uri="http://pyblosxom.sourceforge.net/" version="1.4.3 01/10/2008">
PyBlosxom http://pyblosxom.sourceforge.net/ 1.4.3 01/10/2008
</generator>

<updated>2009-03-13T17:15:02Z</updated>
<!-- icon?  logo?  -->

<entry>
<title type="html">BusySponge 0.5</title>
<category term="" />
<id>http://reagle.org/joseph/blog/2009/03/13/busysponge-0.5</id>
<updated>2009-03-13T17:15:02Z</updated>
<published>2009-03-13T17:15:02Z</published>
<link rel="alternate" type="text/html" href="http://reagle.org/joseph/blog/technology/python/busysponge-0.5.html" />
<content type="html">

&lt;p&gt;In &lt;a href=&quot;http://www.w3.org/2002/08/busy-spunge.html&quot;&gt;2002 I began
thinking&lt;/a&gt; about how to best capture and share the many web-pages and small
tasks of the day. I thought of it as a &quot;busy sponge&quot;: logging bookmarks and
tasks to my team page with a minimum of typing. Furthermore, I wanted to tag
each entry with a keyword which could then be used in queries. I posted an &lt;a
href=&quot;http://lists.w3.org/Archives/Public/www-rdf-interest/2003Jan/0126.html&quot;&gt;implementation
in 2003&lt;/a&gt; which was complemented by the fact that the tasks on my team page
were syndicated (via RSS) -- and used to generate my &quot;two minute reports&quot; at
the weekly staff meeting at MIT. This was a number of years before the notion
of micro-blogging became popular.&lt;/p&gt;

&lt;p&gt;Two interesting features have further matured: I wanted &lt;em&gt;it&lt;/em&gt; to fetch
the title of a URL -- typing HTML was a hassle -- and I wanted to tell it which
of my pages to log the entry to: my personal weblog or work team page. For
example:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;urd:/home/reagle &amp;gt; b http://pesto.redgecko.org/dispatch.html j
python Noted ^ &lt;/code&gt; &lt;/p&gt;

&lt;p&gt;is a sponge of a URL to my &quot;j&quot; (work) page where &quot;^&quot; becomes the
hypertextual page title, resulting in:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;&amp;lt;li class=&quot;event&quot; id=&quot;e090313-f7fd&quot;&amp;gt;090313: python] &amp;lt;a
href=&quot;http://pesto.redgecko.org/dispatch.html&quot;&amp;gt;Noted URL dispatch
&amp;amp;mdash; Pesto: a library for WSGI applications&amp;lt;/a&amp;gt;&amp;lt;/li&amp;gt; &lt;/code&gt;
&lt;/p&gt;

&lt;p&gt;With BusySponge 0.5 (now &lt;a href=&quot;/joseph/2009/01/thunderdell.html&quot;&gt;distributed&lt;/a&gt; as part of Thunderdell), this has
matured into a set of classes for webpage screen scraping and a set of logger
functions. So, for example, I might sponge a comment about a URL and indicate
it should log it to my bibliographic mindmap (&lt;a
href=&quot;http://reagle.org/joseph/2009/01/thunderdell.html&quot;&gt;Thunderdell&lt;/a&gt;) and
it will do its best to fetch the page author, title, date, publisher, permanent
link, excerpt of first substantive paragraph, etc. The default heuristics do a
surprisingly decent job -- certainly better than typing it from scratch -- and
the specific scrapers (e.g., Wikipedia, MARC email archives) are quite good.&lt;/p&gt;
</content>
</entry>
</feed>
