2012 May 01 | DOI metadata and change_case
With
Zotero
and
Mendeley
providing excellent web-centric bibliographic tools, I don’t imagine I’ll ever get converts to
Thunderdell
. Still, it and
BusySponge
are integral to how I work and I enjoy the occasional Python
programming. Some disjointed thoughts from the bibliographic
screen-scraping front.
-
I’m envious of Zotero’s many user-contributed
translators
for scraping a particular webpage for bibliographic data. I wish there was a
command-line
or
API
I could call and take advantage of.
-
CrossRef now offers a DOI-to-bibliography
Web service
; BusySponge now
uses
this (via the JSON format).
-
Even if one finds bibliographic data, it is often dirty. For
instance, is the item’s title in title or sentence case? One can easily
convert sentence to title case, but not vice versa. Hence, I’m quite
pleased with
change_case
’s ability to detect title case and convert it to sentence case (i.e., detecting proper nouns that should remain capitalized).
-
Because the
New York Times
redirects-with-a-cookie and
httplib2
can't easily handle that, I now make use of the awesome
Requests
("HTTP for Humans") -- and
web_little
is now little indeed!
-
I’m not yet satisfied with my generic scrapers ability to identify
an article’s author; I wonder if there are any tricks out there for
this?
(I’ve been meaning to post an update on Thunderdell and an interesting conversation with
Nathan Matias
about his work prompted me to do so.)
this entry
posted to
technology/python
;
comments (0)
2011 Feb 10 | MW: a command line interface to MediaWiki
The MediaWiki experience can be as frustrating to a hacker as it is for a
newbie. Editing on the Web page is annoying and the syntax is atrocious. For
myself, I prefer using
Markdown
syntax, a good
text editor,
pandoc
, and a
distributed version control system (VCS). (That's how I wrote my book.)
I, and others, crave a similar toolset for editing MediaWikis. I tried
WikipediaFS
once, but looking at a versioned wiki as a simple filesystem didn't do the
trick and the project is unmaintained. The
mvs
Mediawiki client comes with Ubuntu, but I could never get it to work.
wikish
is OK but doesn't do
all that I would like.
Recently, I stumbled upon Ian Weller's
mw
, "VCS-like
nonsense for MediaWiki websites". He provides a great foundation and the basic
pull
,
diff
,
commit
and
status
commands. Since it's written in Python, I could actually
grok it, extend
pull
so it can pull new updates and warn of
conflicts, and provide simple
merge
functionality.
You can see examples of the pull, conflict, and merge functionality in a
short
MW
tutorial
I drafted.
this entry
posted to
technology/python
;
comments (0)
2010 Nov 24 | Smoother recording on VirtualBox
For a number of years now I've been using NaturallySpeaking speech recognition via a virtualized Windows environment on Ubuntu. Speech recognition gurus will recommend using an external USB sound card so as to avoid any electrical interference common with soundcards within the computer case. This is fine for the most part, except that VirtualBox's performance with USB sound devices can be a little choppy unless one uses a low-latency (i.e., real-time) kernel. However, Ubuntu only seems to release such kernels every other release.
Luckily, I recently
discovered
that one can use USB microphones within VirtualBox as emulated hardware devices -- by making use of the
VBOX_ALSA_DAC_DEV
and
VBOX_ALSA_ADC_DEV
environmental variables. That is, instead of letting VirtualBox see the microphone as a USB device, you let Linux keep hold of it as a USB device, and it offers an emulated sound card to VirtualBox. This seems to avoid any audio problems in recording/playback, and means I am no longer dependent upon non-generic kernels.
I have posted a
small Python script
that sets these variables given a particular USB microphone that you want to use, and also sets the microphone volume level, since that is often set at zero in the Linux context.
this entry
posted to
technology/python
;
comments (0)
2010 May 17 | Sentence Case in Bibliography
As a PhD student, one of the first bibliographic annoyances I encountered was when I had to format a paper using the APA system, which requires titles to be in
sentence-case
. This means only the first word of a phrase and proper nouns are capitalized. Previously, I had kept the titles of my citations in title-case. The consequence of having to use the APA format was the need to then go in and manually lower case all words that were not proper-nouns in my bibliographic database. However, once this work was done, I realized keeping my data in sentence-case was preferable, as title-case essentially loses information. Yet, this still requires me to manually lowercase some words for automatically captured sources. I am not aware of any bibliographic software that handles this issue well, and the good folks at Zotero have an interesting
bug ticket
open on the issue.
On Friday, while I was doing my weekly fixes to the automatically captured sources in my field notes/mindmap/bibliography, I thought to myself that there are plenty of word lists around, such as those used by spellcheckers, and couldn't I finally automate this menial task? However, I knew that I use lots of proper nouns that probably do not appear in common dictionaries. Therefore, I applied Python's
Natural Language Toolkit
tokenizer and parts of speech tagger to the text of my dissertation to create a custom word list of proper-nouns that I use. These are used with the dictionary found on my system at
/usr/share/dict/american-english
to transform a title-cased sentence into a sentence-cased sentence. Basically, if the word is in my custom list, is in the word list only as a capitalized word, or not in the word list at all, it merits capitalization, else lower-case it. The code is available as a
module
to the Busy Sponge component of the Thunderdell bibliographic tools. It works fairly well and will certainly make that end of the week menial task all the more easier.
this entry
posted to
technology/python
;
comments (0)
2010 Apr 06 | Indexing a Book
One of the last significant steps for an author is to compile an index, unless they opt to have someone else do it. Publishers often recommend authors create the index as they know the material best. However, while professionals have sophisticated -- but proprietary -- tools to help them, authors are offered only the techniques of using index cards or spreadsheets. Neither of these options is appealing to me.
I thought it would be nice to simply compile a list of entries in the form of
topic (page#|see) (sub)topic
and let a
script
do the rest. It's a bit of a hack but it does the job and can collapse all subentries below a particular threshold. I like having specified subentries, even if there is only one or two of them for a particular entry:
Apology
and leadership, 124
"Sorry but...," 54
But if the publisher says they want those collapsed, it is easy enough:
Apology, 54, 124
this entry
posted to
technology/python
;
comments (0)
2010 Feb 16 | Diffing Word Files
For the most part, I wrote my dissertation and book manuscript using a simplified version of markdown complemented with biblatex citations. Because it was a simple text file, it made managing the edits to the manuscript very easy. I could do global textual replacements trivially. Also, obviously, it was trivial to generate PDFs, HTML, etc. Using Mercurial, I could take advantage of some nice features like the "attic" extension which allows me to keep change sets on the side to be applied only when appropriate. So, for example, the changes necessary generate HTML were kept in the attic and would only be applied when I wanted that.
Unfortunately, once the manuscript went into the MIT Press system, I had to use Microsoft Word. As much as much as the Word document format annoys me, I understand it is widely used, and I can't think of an easy alternative that also provides the capability for editorial annotations. Nonetheless, I had a difficult time
seeing
changes in Microsoft Word, and want to backport the changes into my source files. And, there does not appear to be a nice textual difference tool for Word documents.
I have
posted
a small Python script that makes use of
antiword
and
dwdiff
but also gives me context on either side of the change. It, of course, doesn't work well with formatting, but is useful and will generate output like the following:
reflects {-the-} [+a+] stabilization
a {-number of pragmatic questions: it-}
[+project was conceived. It+] would
there {-will-} [+would+] be
article {-will-} [+would+] be
linked {-to from-} [+via+] a
this entry
posted to
technology/python
;
comments (0)
2010 Jan 12 | Thunderdell v1.2
I have tagged a new release of
Thunderdell
, the Freemind mindmap to biblatex utilities. Improvements include:
-
The ability to emit biblatex or BibTex.
-
More consistent/improved Unicode support, particularly in bibliographic keys.
-
New feature for requesting either the long or short URL for Wikipedia sources.
-
New feature for requesting that only URLs for exclusively online sources be emitted.
-
New test for online sources.
-
More biblatex fields, such as original publisher and year.
-
Faster processing by using lxml, but can still fall back to etree.
-
Use of optparse.
-
The "essential node" function was removed.
this entry
posted to
technology/python
;
comments (0)
2009 Aug 21 | Shared Clipboard and Chicago Page Ranges
I use VirtualBox to run a Windows guest, and unfortunately the shared clipboard between the two is sometimes buggy. I recently posted a script for a very
robust clipboard
using a network shared file. Also, I'm doing the final checks on the book manuscript and the Chicago Manual of Style has
odd and confusing rules
for specifying ranges of page numbers. The
CMS page range validator
looks through my sources files and prints out any likely to be counter to Chicago style.
this entry
posted to
technology/python
;
comments (0)
2009 Mar 13 | BusySponge 0.5
In
2002 I began
thinking
about how to best capture and share the many web-pages and small
tasks of the day. I thought of it as a "busy sponge": logging bookmarks and
tasks to my team page with a minimum of typing. Furthermore, I wanted to tag
each entry with a keyword which could then be used in queries. I posted an
implementation
in 2003
which was complemented by the fact that the tasks on my team page
were syndicated (via RSS) -- and used to generate my "two minute reports" at
the weekly staff meeting at MIT. This was a number of years before the notion
of micro-blogging became popular.
Two interesting features have further matured: I wanted
it
to fetch
the title of a URL -- typing HTML was a hassle -- and I wanted to tell it which
of my pages to log the entry to: my personal weblog or work team page. For
example:
urd:/home/reagle > b http://pesto.redgecko.org/dispatch.html j
python Noted ^
is a sponge of a URL to my "j" (work) page where "^" becomes the
hypertextual page title, resulting in:
<li class="event" id="e090313-f7fd">090313: python] <a
href="http://pesto.redgecko.org/dispatch.html">Noted URL dispatch
— Pesto: a library for WSGI applications</a></li>
With BusySponge 0.5 (now
distributed
as part of Thunderdell), this has
matured into a set of classes for webpage screen scraping and a set of logger
functions. So, for example, I might sponge a comment about a URL and indicate
it should log it to my bibliographic mindmap (
Thunderdell
) and
it will do its best to fetch the page author, title, date, publisher, permanent
link, excerpt of first substantive paragraph, etc. The default heuristics do a
surprisingly decent job -- certainly better than typing it from scratch -- and
the specific scrapers (e.g., Wikipedia, MARC email archives) are quite good.
this entry
posted to
technology/python
;
comments (0)
2009 Jan 22 | Thunderdell 1.0 (was: 'FreeMind Extract')
I'm releasing the latest set of Freemind Bibliographic Extraction scripts. I'm calling it a "1.0" release because:
-
I decided to give it a funny name.
-
I now address an unlikely but long-time screw case.
-
This and other cases are now tested by doc_tests.
-
I updated the generation of bibliographic keys to remove 'and' from the author portion and always include a title suffix -- instead of just when there is a collision. Keys are now a bit more terse and more stable.
-
I now emit biblatex, a much more complete and powerful bibliographic format.
It also now has its
own webpage
, from which you can download it.
this entry
posted to
technology/python
;
comments (0)
Page 0 of 2
>>