2010 Feb 16 | Diffing Word Files
For the most part, I wrote my dissertation and book manuscript using a simplified version of markdown complemented with biblatex citations. Because it was a simple text file, it made managing the edits to the manuscript very easy. I could do global textual replacements trivially. Also, obviously, it was trivial to generate PDFs, HTML, etc. Using Mercurial, I could take advantage of some nice features like the "attic" extension which allows me to keep change sets on the side to be applied only when appropriate. So, for example, the changes necessary generate HTML were kept in the attic and would only be applied when I wanted that.
Unfortunately, once the manuscript went into the MIT Press system, I had to use Microsoft Word. As much as much as the Word document format annoys me, I understand it is widely used, and I can't think of an easy alternative that also provides the capability for editorial annotations. Nonetheless, I had a difficult time seeing changes in Microsoft Word, and want to backport the changes into my source files. And, there does not appear to be a nice textual difference tool for Word documents.
I have posted a small Python script that makes use of antiword and dwdiff but also gives me context on either side of the change. It, of course, doesn't work well with formatting, but is useful and will generate output like the following:
reflects {-the-} [+a+] stabilization
a {-number of pragmatic questions: it-} [+project was conceived. It+] would
there {-will-} [+would+] be
article {-will-} [+would+] be
linked {-to from-} [+via+] a
this entry posted to
technology/python;
comments (0)
2010 Jan 12 | Thunderdell v1.2
I have tagged a new release of Thunderdell, the Freemind mindmap to biblatex utilities. Improvements include:
- The ability to emit biblatex or BibTex.
- More consistent/improved Unicode support, particularly in bibliographic keys.
- New feature for requesting either the long or short URL for Wikipedia sources.
- New feature for requesting that only URLs for exclusively online sources be emitted.
- New test for online sources.
- More biblatex fields, such as original publisher and year.
- Faster processing by using lxml, but can still fall back to etree.
- Use of optparse.
- The "essential node" function was removed.
this entry posted to
technology/python;
comments (0)
2009 Aug 21 | Shared Clipboard and Chicago Page Ranges
I use VirtualBox to run a Windows guest, and unfortunately the shared clipboard between the two is sometimes buggy. I recently posted a script for a very robust clipboard using a network shared file. Also, I'm doing the final checks on the book manuscript and the Chicago Manual of Style has odd and confusing rules for specifying ranges of page numbers. The CMS page range validator looks through my sources files and prints out any likely to be counter to Chicago style.
this entry posted to
technology/python;
comments (0)
2009 Mar 13 | BusySponge 0.5
In 2002 I began
thinking about how to best capture and share the many web-pages and small
tasks of the day. I thought of it as a "busy sponge": logging bookmarks and
tasks to my team page with a minimum of typing. Furthermore, I wanted to tag
each entry with a keyword which could then be used in queries. I posted an implementation
in 2003 which was complemented by the fact that the tasks on my team page
were syndicated (via RSS) -- and used to generate my "two minute reports" at
the weekly staff meeting at MIT. This was a number of years before the notion
of micro-blogging became popular.
Two interesting features have further matured: I wanted it to fetch
the title of a URL -- typing HTML was a hassle -- and I wanted to tell it which
of my pages to log the entry to: my personal weblog or work team page. For
example:
urd:/home/reagle > b http://pesto.redgecko.org/dispatch.html j
python Noted ^
is a sponge of a URL to my "j" (work) page where "^" becomes the
hypertextual page title, resulting in:
<li class="event" id="e090313-f7fd">090313: python] <a
href="http://pesto.redgecko.org/dispatch.html">Noted URL dispatch
— Pesto: a library for WSGI applications</a></li>
With BusySponge 0.5 (now distributed as part of Thunderdell), this has
matured into a set of classes for webpage screen scraping and a set of logger
functions. So, for example, I might sponge a comment about a URL and indicate
it should log it to my bibliographic mindmap (Thunderdell) and
it will do its best to fetch the page author, title, date, publisher, permanent
link, excerpt of first substantive paragraph, etc. The default heuristics do a
surprisingly decent job -- certainly better than typing it from scratch -- and
the specific scrapers (e.g., Wikipedia, MARC email archives) are quite good.
this entry posted to
technology/python;
comments (0)
2009 Jan 22 | Thunderdell 1.0 (was: 'FreeMind Extract')
I'm releasing the latest set of Freemind Bibliographic Extraction scripts. I'm calling it a "1.0" release because:
- I decided to give it a funny name.
- I now address an unlikely but long-time screw case.
- This and other cases are now tested by doc_tests.
- I updated the generation of bibliographic keys to remove 'and' from the author portion and always include a title suffix -- instead of just when there is a collision. Keys are now a bit more terse and more stable.
- I now emit biblatex, a much more complete and powerful bibliographic format.
It also now has its own webpage, from which you can download it.
this entry posted to
technology/python;
comments (0)
2008 Sep 27 | Python Class Tools
So much time in teaching is spent on trivial but time-consuming tasks. Having taught for a couple of years now, some things have gotten easier, but many chores remain. I've tried various free tools, and there are many proprietary solutions, but none satisfy. The following are personal -- and poorly documented -- hacks that are useful to me.
- Grade sheet
-
I used to use a sophisticated Excel spreadsheet I found on the Web, but the author has since taken it proprietary/commercial, plus it didn't support a simple point system (e.g., 100 points allocated throughout the semester). Also, my sheet now accommodates some of my idiosyncratic policies for class responses and attendance (e.g., dropping the lowest X grades, Y freebie absences).
- Grade Reports
-
This Python script makes use of the XLRD library to read an instance of the above, and generate a report for the whole class or a single student, to the console or as a messages in my drafts mailbox.
- Class Calendar
-
Given the duration of the semester, the days on which a class meets, university holidays, and the class on which an assignment is due, this generates a calendar for the syllabus. (A surprisingly time-consuming manual task.) I hope to incorporate Jewish holidays into this if I can find such a library.
- Mailbox Prettyprint
-
Instead of using the inaccessible and proprietary BlackBoard product, I ask students to e-mail their responses which are automatically sorted into a mailbox. Before class, this script will prettyprint all responses sent since the last class to an HTML page, including (now) MS Word attachments.
this entry posted to
technology/python;
comments (0)
2007 Jun 20 | Creating a Semester's Class Schedule
I just discovered Python's awesome dateutil package which
implements much of the iCalendar standard, including recurrences!
Consequently, it's trivial to generate a calendar for the days classes
meet. I assume with a little work one could even handle the holidays.
In any case, here's an example:
#!/usr/bin/python2.5
from dateutil.rrule import *
from dateutil.parser import *
sem_start = '20070903T140000'
sem_end = '20071212T140000'
days = MO,WE
meetings = list(rrule(WEEKLY, wkst=SU, byweekday=(days),
dtstart=parse(sem_start), until=parse(sem_end)))
for meeting in meetings:
print meeting.strftime("%b %d %a")
this entry posted to
technology/python;
comments (0)
2007 Feb 08 | ZotZero and BusySponge
I have been reading of ZotZero in Josh's
blog and am hopeful that it will help bridge the gap between the
dynamic and informal life of the Web (e.g., reading, blogging,
bookmarks, RSS, etc.) and the seemingly lifeless task of bibliography.
Wouldn't it be nice if citing something was as easy as bookmarking it?
Or, if you could read what your colleagues were reading via an RSS feed?
While I haven't played with ZotZero
yet -- and I use the Konqueror browser not Firefox -- I share this
vision and hope to see it become a reality. And since I recently posted
of my Freemind Extract tool (for transforming a mindmap into a
bibliography) I realize I haven't spoken of the flipside a couple of
years: absorbing information. But first, a historical digression.
The way I make note of and annotate resources and tasks evolved out
of two practices at the W3C. The first of which was a decree by Timbl
which I objected to strongly at the time: the great datespace shift of
1999. Because the W3C's root file/name space was getting too crowded,
Tim's new policy forbid new top-level spaces like www.w3.org/Signature or www.w3.org/Encryption.
There were too many already and who were we to lay claim to such spaces
for all time? There might be a new digital signature activity 10 years
from now, so where would they live? (Consequently, the subsequent key
management working group received www.w3.org/2001/XKMS.)
I appreciated this concern at the root level, but cringed at only being
able to organize other files by date of creation. Try finding a
document you wrote a couple of years ago in a space no more structured
than /2001/{01,..,12} and is shared by 50+ other
people. It's not easy. I realize the only way I could keep track of
things I had worked on was to have a log of events and documents I
cared about. (This shift also affected how we collaborated in our
shared space given issues of ownership, access controls, and version
management -- but perhaps more on that another time.)
The second W3C practice was that each of its hosts (worksites) had a
weekly meeting at which we shared the important events of the past week
and raised agenda issues for common discussion. To make it easier for
the minute takers we e-mailed two minutes to an e-mail list and a bot
would collect them into draft minutes which would be augmented with the
IRC log.
Preparing my two minutes before 10 a.m. Tuesday morning always
seemed more frantic than it need be. But, once I started keeping a log
of what I had done as a result of the datespace shift, it became
trivial. (In fact, I wrote a script to grab the past week
automatically, and even generated a RSS feed from the work log so that
one could "subscribe" to my work log by keyword/task -- anticipating
RSS feeds of tagged bookmarks.)
By 2002 I had tired of manually logging events, via an HTML editor,
to my personal blog and work log, so I wrote a specification for a
dream tool: Busy Sponge. It would soak up everything I touched of importance and send it to the right place. I opted for a commandline tool I named b.py.
Returning to today, and a challenge I'm sure I share with
the ZotZero folks, is how to automatically scrape as much metadata
as
possible from a Web resource? Busy Sponge continues to be the primary
way I input data into my work log and mind maps. Because metadata is no
more common or standard on the Web as it was five years ago I am
dependent on screen scraping heuristics. For example, the following
code allows me to easily capture and cite messages of Wikipedia mailing
lists -- and that is why it was such a hassle when the archives broke:
elif url.startswith("http://marc.theaimsgroup.com/"):
try:
author = re.search('''From: *(.*?)''', html).group(1)
except AttributeError:
author = re.search('''From: *(.*)''', html).group(1)
author = author.replace(' () ','@').replace(' ! ','.')\
.replace('<', '<').replace('>', '>')
author = author.split(' <')[0]
author = author.replace('"','')
mlist = re.search('''List: *(.*?)''', html).group(1)
mdate = re.search('''Date: *(.*?)''', html).group(1)
...
Unfortunately, beyond a couple mailing list archives and wikis --
which, fortunately, are the majority of what I grab -- I have to
manually edit my sponges with proper meta/bibliographic data. And
curses upon those bloggers who make it difficult to determine the
author of an article or even the whole blog -- even a pseudonym will
do! Beyond the usage of my tool, I can imagine much value in a social
tool that allows users to share annotations, or even screen-scraping
"plug-ins." One can hope!
this entry posted to
technology/python;
comments (0)
[This entry is now deprecated, please see Thunderdell (Freemind Extract).]
I am releasing version 0.6 of
the fe mindmapping bibliographic tools. As
explained in Extracting
Bibliographies from Freemind, these are python scripts that are able to
convert between Freemind
mindmaps (using a few simple conventions) and bibliographic formats (i.e.,
OO.org CSV and bibtex). It also makes it very easy for me to search my notes and quote authors
(e.g., "Giddens").
There are no massive changes, just the usual tweaks and bug fixes. One
notable change is the regular expressions in pe.py are much improved,
and it's quite uncanny at extracting bibliographic keys of the form
'Snide and Smith (2003)' or '(Snide, Smith and Smittie 2004)' from
natural language text.
this entry posted to
technology/python;
comments (0)
[This entry is now deprecated, please see Thunderdell (Freemind Extract).]
I am releasing a new zipfile of
the fe mindmapping bibliographic tools. As
explained in Extracting
Bibliographies from Freemind, these are python scripts that are able to
convert between Freemind
mindmaps (using a few simple conventions) and bibliographic formats (i.e.,
OO.org CSV and bibtex). This approach is preferable to other bibliographic
tools with limited/constrained forms for text entry. With
fe one has a complete outline/map of texts, with
figures, images, tables, links to sites, etc.; one can easily organize texts
by topic or in separate mindmap files; and one can generate queries where
each matching line has its appropriate citation with year and page number
(e.g., "Giddens").
Unlike many bibliographic tools, it does not query on-line databases, but one
can use such tools (e.g., tellico or refworks) to query and generate bibtex
bibliographies and then use be.py to convert them to a mindmap.
- fe.py: extract bibliographic data from
bibliographic MM (dependent on XML ElementTree and
optionally bibtex2html)
- this version is faster since it uses XML ElementTree
instead of XML
Tramp.
- given a list of authors cited (*.rl, such as that generated by
pe.py or pyblink) bibtex2html will
generate a bibliography of only those authors.
- bibliographic maps are searchable from the command-line or via the
Web (e.g., search
results for "Giddens" in my mindmap [java|flash]).
- a Web of mindmaps can be searched for essential entries
(the title is bold) and placed in a new mindmap for studying.
fe.py -h (help)
-v (output csv)
-c (chase links between MMs)
-w (output bibtex & html file) -a (include abstracts)
-s (use bibtex style)
-q (query)
-e (create new MM of essential works)
- be.py: extract a MM from a bibtex file (dependent on bibstuff)
- de.py: extract a MM from a dictated text file
- ff.py: fix the case of titles of a bibliographic MM
- pe.py: extract the bibliographic keys of the form 'Snide and Smith
(2003)' or '(Snide, Smith and Smittie 2004)' from natural language
text
- te.py: parse inconsistently formatted textual bibliographies into
bibliographic MM (e.g., from syllabi, cb2Bib is cool too)
this entry posted to
technology/python;
comments (0)