Open Codex technology :: python

2010 May 17 | Sentence Case in Bibliography

As a PhD student, one of the first bibliographic annoyances I encountered was when I had to format a paper using the APA system, which requires titles to be in sentence-case. This means only the first word of a phrase and proper nouns are capitalized. Previously, I had kept the titles of my citations in title-case. The consequence of having to use the APA format was the need to then go in and manually lower case all words that were not proper-nouns in my bibliographic database. However, once this work was done, I realized keeping my data in sentence-case was preferable, as title-case essentially loses information. Yet, this still requires me to manually lowercase some words for automatically captured sources. I am not aware of any bibliographic software that handles this issue well, and the good folks at Zotero have an interesting bug ticket open on the issue.

On Friday, while I was doing my weekly fixes to the automatically captured sources in my field notes/mindmap/bibliography, I thought to myself that there are plenty of word lists around, such as those used by spellcheckers, and couldn't I finally automate this menial task? However, I knew that I use lots of proper nouns that probably do not appear in common dictionaries. Therefore, I applied Python's Natural Language Toolkit tokenizer and parts of speech tagger to the text of my dissertation to create a custom word list of proper-nouns that I use. These are used with the dictionary found on my system at /usr/share/dict/american-english to transform a title-cased sentence into a sentence-cased sentence. Basically, if the word is in my custom list, is in the word list only as a capitalized word, or not in the word list at all, it merits capitalization, else lower-case it. The code is available as a module to the Busy Sponge component of the Thunderdell bibliographic tools. It works fairly well and will certainly make that end of the week menial task all the more easier.

this entry posted to technology/python;
comments (0)

2010 Apr 06 | Indexing a Book

One of the last significant steps for an author is to compile an index, unless they opt to have someone else do it. Publishers often recommend authors create the index as they know the material best. However, while professionals have sophisticated -- but proprietary -- tools to help them, authors are offered only the techniques of using index cards or spreadsheets. Neither of these options is appealing to me.

I thought it would be nice to simply compile a list of entries in the form of topic (page#|see) (sub)topic and let a script do the rest. It's a bit of a hack but it does the job and can collapse all subentries below a particular threshold. I like having specified subentries, even if there is only one or two of them for a particular entry:

Apology
    and leadership, 124
    "Sorry but...," 54

But if the publisher says they want those collapsed, it is easy enough:

Apology, 54, 124

this entry posted to technology/python;
comments (0)

2010 Feb 16 | Diffing Word Files

For the most part, I wrote my dissertation and book manuscript using a simplified version of markdown complemented with biblatex citations. Because it was a simple text file, it made managing the edits to the manuscript very easy. I could do global textual replacements trivially. Also, obviously, it was trivial to generate PDFs, HTML, etc. Using Mercurial, I could take advantage of some nice features like the "attic" extension which allows me to keep change sets on the side to be applied only when appropriate. So, for example, the changes necessary generate HTML were kept in the attic and would only be applied when I wanted that.

Unfortunately, once the manuscript went into the MIT Press system, I had to use Microsoft Word. As much as much as the Word document format annoys me, I understand it is widely used, and I can't think of an easy alternative that also provides the capability for editorial annotations. Nonetheless, I had a difficult time seeing changes in Microsoft Word, and want to backport the changes into my source files. And, there does not appear to be a nice textual difference tool for Word documents.

I have posted a small Python script that makes use of antiword and dwdiff but also gives me context on either side of the change. It, of course, doesn't work well with formatting, but is useful and will generate output like the following:

   reflects {-the-} [+a+] stabilization
   a {-number of pragmatic questions: it-} 
     [+project was conceived. It+] would
   there {-will-} [+would+] be
   article {-will-} [+would+] be
   linked {-to from-} [+via+] a

this entry posted to technology/python;
comments (0)

2010 Jan 12 | Thunderdell v1.2

I have tagged a new release of Thunderdell, the Freemind mindmap to biblatex utilities. Improvements include:

this entry posted to technology/python;
comments (0)

2009 Aug 21 | Shared Clipboard and Chicago Page Ranges

I use VirtualBox to run a Windows guest, and unfortunately the shared clipboard between the two is sometimes buggy. I recently posted a script for a very robust clipboard using a network shared file. Also, I'm doing the final checks on the book manuscript and the Chicago Manual of Style has odd and confusing rules for specifying ranges of page numbers. The CMS page range validator looks through my sources files and prints out any likely to be counter to Chicago style.

this entry posted to technology/python;
comments (0)

2009 Mar 13 | BusySponge 0.5

In 2002 I began thinking about how to best capture and share the many web-pages and small tasks of the day. I thought of it as a "busy sponge": logging bookmarks and tasks to my team page with a minimum of typing. Furthermore, I wanted to tag each entry with a keyword which could then be used in queries. I posted an implementation in 2003 which was complemented by the fact that the tasks on my team page were syndicated (via RSS) -- and used to generate my "two minute reports" at the weekly staff meeting at MIT. This was a number of years before the notion of micro-blogging became popular.

Two interesting features have further matured: I wanted it to fetch the title of a URL -- typing HTML was a hassle -- and I wanted to tell it which of my pages to log the entry to: my personal weblog or work team page. For example:

urd:/home/reagle > b http://pesto.redgecko.org/dispatch.html j python Noted ^

is a sponge of a URL to my "j" (work) page where "^" becomes the hypertextual page title, resulting in:

<li class="event" id="e090313-f7fd">090313: python] <a href="http://pesto.redgecko.org/dispatch.html">Noted URL dispatch &mdash; Pesto: a library for WSGI applications</a></li>

With BusySponge 0.5 (now distributed as part of Thunderdell), this has matured into a set of classes for webpage screen scraping and a set of logger functions. So, for example, I might sponge a comment about a URL and indicate it should log it to my bibliographic mindmap (Thunderdell) and it will do its best to fetch the page author, title, date, publisher, permanent link, excerpt of first substantive paragraph, etc. The default heuristics do a surprisingly decent job -- certainly better than typing it from scratch -- and the specific scrapers (e.g., Wikipedia, MARC email archives) are quite good.

this entry posted to technology/python;
comments (0)

2009 Jan 22 | Thunderdell 1.0 (was: 'FreeMind Extract')

I'm releasing the latest set of Freemind Bibliographic Extraction scripts. I'm calling it a "1.0" release because:

  1. I decided to give it a funny name.
  2. I now address an unlikely but long-time screw case.
  3. This and other cases are now tested by doc_tests.
  4. I updated the generation of bibliographic keys to remove 'and' from the author portion and always include a title suffix -- instead of just when there is a collision. Keys are now a bit more terse and more stable.
  5. I now emit biblatex, a much more complete and powerful bibliographic format.

It also now has its own webpage, from which you can download it.

this entry posted to technology/python;
comments (0)

2008 Sep 27 | Python Class Tools

So much time in teaching is spent on trivial but time-consuming tasks. Having taught for a couple of years now, some things have gotten easier, but many chores remain. I've tried various free tools, and there are many proprietary solutions, but none satisfy. The following are personal -- and poorly documented -- hacks that are useful to me.

Grade sheet

I used to use a sophisticated Excel spreadsheet I found on the Web, but the author has since taken it proprietary/commercial, plus it didn't support a simple point system (e.g., 100 points allocated throughout the semester). Also, my sheet now accommodates some of my idiosyncratic policies for class responses and attendance (e.g., dropping the lowest X grades, Y freebie absences).

Grade Reports

This Python script makes use of the XLRD library to read an instance of the above, and generate a report for the whole class or a single student, to the console or as a messages in my drafts mailbox.

Class Calendar

Given the duration of the semester, the days on which a class meets, university holidays, and the class on which an assignment is due, this generates a calendar for the syllabus. (A surprisingly time-consuming manual task.) I hope to incorporate Jewish holidays into this if I can find such a library.

Mailbox Prettyprint

Instead of using the inaccessible and proprietary BlackBoard product, I ask students to e-mail their responses which are automatically sorted into a mailbox. Before class, this script will prettyprint all responses sent since the last class to an HTML page, including (now) MS Word attachments.

this entry posted to technology/python;
comments (0)

2007 Jun 20 | Creating a Semester's Class Schedule

I just discovered Python's awesome dateutil package which implements much of the iCalendar standard, including recurrences! Consequently, it's trivial to generate a calendar for the days classes meet. I assume with a little work one could even handle the holidays. In any case, here's an example:

#!/usr/bin/python2.5

from dateutil.rrule import *
from dateutil.parser import *

sem_start = '20070903T140000'
sem_end = '20071212T140000'
days = MO,WE

meetings = list(rrule(WEEKLY, wkst=SU, byweekday=(days),
    dtstart=parse(sem_start), until=parse(sem_end)))
for meeting in meetings:
    print meeting.strftime("%b %d %a")

this entry posted to technology/python;
comments (0)

2007 Feb 08 | ZotZero and BusySponge

I have been reading of ZotZero in Josh's blog and am hopeful that it will help bridge the gap between the dynamic and informal life of the Web (e.g., reading, blogging, bookmarks, RSS, etc.) and the seemingly lifeless task of bibliography. Wouldn't it be nice if citing something was as easy as bookmarking it? Or, if you could read what your colleagues were reading via an RSS feed?

While I haven't played with ZotZero yet -- and I use the Konqueror browser not Firefox -- I share this vision and hope to see it become a reality. And since I recently posted of my Freemind Extract tool (for transforming a mindmap into a bibliography) I realize I haven't spoken of the flipside a couple of years: absorbing information. But first, a historical digression.

The way I make note of and annotate resources and tasks evolved out of two practices at the W3C. The first of which was a decree by Timbl which I objected to strongly at the time: the great datespace shift of 1999. Because the W3C's root file/name space was getting too crowded, Tim's new policy forbid new top-level spaces like www.w3.org/Signature or www.w3.org/Encryption. There were too many already and who were we to lay claim to such spaces for all time? There might be a new digital signature activity 10 years from now, so where would they live? (Consequently, the subsequent key management working group received www.w3.org/2001/XKMS.) I appreciated this concern at the root level, but cringed at only being able to organize other files by date of creation. Try finding a document you wrote a couple of years ago in a space no more structured than /2001/{01,..,12} and is shared by 50+ other people. It's not easy. I realize the only way I could keep track of things I had worked on was to have a log of events and documents I cared about. (This shift also affected how we collaborated in our shared space given issues of ownership, access controls, and version management -- but perhaps more on that another time.)

The second W3C practice was that each of its hosts (worksites) had a weekly meeting at which we shared the important events of the past week and raised agenda issues for common discussion. To make it easier for the minute takers we e-mailed two minutes to an e-mail list and a bot would collect them into draft minutes which would be augmented with the IRC log.

Preparing my two minutes before 10 a.m. Tuesday morning always seemed more frantic than it need be. But, once I started keeping a log of what I had done as a result of the datespace shift, it became trivial. (In fact, I wrote a script to grab the past week automatically, and even generated a RSS feed from the work log so that one could "subscribe" to my work log by keyword/task -- anticipating RSS feeds of tagged bookmarks.)

By 2002 I had tired of manually logging events, via an HTML editor, to my personal blog and work log, so I wrote a specification for a dream tool: Busy Sponge. It would soak up everything I touched of importance and send it to the right place. I opted for a commandline tool I named b.py.

Returning to today, and a challenge I'm sure I share with the ZotZero folks, is how to automatically scrape as much metadata as possible from a Web resource? Busy Sponge continues to be the primary way I input data into my work log and mind maps. Because metadata is no more common or standard on the Web as it was five years ago I am dependent on screen scraping heuristics. For example, the following code allows me to easily capture and cite messages of Wikipedia mailing lists -- and that is why it was such a hassle when the archives broke:

elif url.startswith("http://marc.theaimsgroup.com/"):
	try:
		author = re.search('''From: *(.*?)''', html).group(1)
	except AttributeError:
		author = re.search('''From: *(.*)''', html).group(1)
	author = author.replace(' () ','@').replace(' ! ','.')\
		.replace('&lt;', '<').replace('&gt;', '>')
	author = author.split(' <')[0]
	author = author.replace('"','')

	mlist = re.search('''List: *(.*?)''', html).group(1)

	mdate = re.search('''Date: *(.*?)''', html).group(1)
    ...

Unfortunately, beyond a couple mailing list archives and wikis -- which, fortunately, are the majority of what I grab -- I have to manually edit my sponges with proper meta/bibliographic data. And curses upon those bloggers who make it difficult to determine the author of an article or even the whole blog -- even a pseudonym will do! Beyond the usage of my tool, I can imagine much value in a social tool that allows users to share annotations, or even screen-scraping "plug-ins." One can hope!

this entry posted to technology/python;
comments (0)

Open Communities, Media, Source, and Standards XML

by Joseph Reagle

powered by pyblosxom


reagle.org

What I'm reading online (blogroll)


Categories

Archives