Open Codex method

2008 Feb 06 | Digital Posterity

I have over 1000 primary sources in my Wikipedia research mindmaps. In accumulating some of those sources, I have already been confronted with their ephemerality. (And these are public sources only; I know lots of e-mails I would've liked to have access to by the likes of Wales, Sanger, and Stallman that apparently no longer exist.) So, doing a quick check-link analysis of the largest mindmap I find the following: 941 of those resources are "OK"; 21 are "404" (no longer there); and 10 "Timeout". So, just within a few years ~2% aren't readily available. For example, the link to Sanger's 2005 information about his (then) new Digital Universe project is already broken; but I must say news sites are the worst. Then, there are the URLs that don't have what they use to, those that are now password protected, and those that have new URLs because of a site reorganization -- blogs seem to be the worst on this front. Of course, I don't know if this rate is a linear trend and I would be interested in any research that shows longitudinal decrepitude rates of an existing corpus of links.

In any case, I expect my own modest historical inquiries are only the beginning; I think people will be writing histories of Wikipedia and the larger free culture movement decades in the future, though I am not sure how much of what we have today was still be there for them. I was surprised, and happy, to find that someone else is already making use of my Nupedia-l archive, so I thought it would do something similar for my other sources. I don't think this would be of much use to anyone today, and is somewhat "tainted" in that it is my own analytical take and selection of sources -- absent summaries, annotations and excerpts -- but it might be of use in the future.

This archive includes the HTML versions of two mindmaps and a copy of the online resource to which they link to. If you do make use of it, you can continue to refer to it as part of the "Reagle Wikipedia Archive."

This collection wp-sources.tar.bz2 was made by placing the HTML version of the mindmaps (wikip-primary.html and field-notes-cat.html) on a Web server and then issuing:

wget --restrict-file-names=windows -c --recursive --level=1 --span-hosts --convert-links --execute robots=off -t 4 http://reagle.org/joseph/2008/02/wp-srcs/field-note-cats.html

this entry posted to method;
comments (2)

2008 Jan 24 | Too magnificent

I recently read Andrew Ross' "No Collar: The Humane Workplace and Its Hidden Costs: Behind the Myth of the New Office Utopia" in remembrance of my own brief time as consultant in New York's "Silicon Alley" during the booming 90s. I even had a few meetings at Ross' study site: the ever-cool design/strategic/Web firm RazorFish. I like Ross' portrayals of American culture, including the "Celebration Chronicles: Life, Liberty, and the Pursuit of Property Value in Disney's New Town," but I encountered him first in the Sokal Hoax affair -- and was disappointed with his defense of accepting a Po-Mo goobly-gook hoax submission to the prestigious Social Text journal. I expect I am sympathetic to Alan Sokal in this affair because as a former computer scientist I've been acculturated with the maxim of K.I.S.S.: Keep It Simple Stupid. This sensibility persists into my engagement with humanities and social sciences -- though it sometimes causes me to feel alienated and distressed. Similarly, I've always been fond of physicist Richard Feynman's freshman principle regardless of the discipline: if something can't be explained in a freshman lecture, it is not yet well understood. (He was quite a character, and is also alleged to have said that the philosophy of science is as useful to scientists as ornithology is to birds.)

So reading Ross again prompted me to peruse his Wikipedia article, which then led to the Sokal hoax article, and then to a wonderful list of similar hoaxes in other disciplines, including the fascinating tale of the Bogdanov brothers:

The Bogdanov Affair is an academic dispute regarding the legitimacy of a series of theoretical physics papers written by French twin brothers Igor and Grichka Bogdanov (alternately spelt Bogdanoff). These papers were published in reputable scientific journals, and were alleged by their authors to culminate in a proposed theory for describing what occurred at the Big Bang. The controversy started in 2002 when rumors spread on Usenet newsgroups that the work was a deliberate hoax intended to target weaknesses in the peer review system employed by the physics community to select papers for publication in academic journals. While the Bogdanov brothers continue to defend the veracity of their work, many physicists have alleged that the papers are nonsense, considering this evidence of the fallibility inherent within the peer review system. The debate over whether the work represented a contribution to physics, or instead was meaningless, spread from Usenet to many other Internet forums, including the blogs of notable physicists and both the French and English Wikipedia encyclopdia projects.

While perhaps not as common, the natural sciences too can suffer from incomprehensibility masquerading as erudition. In fact, some of the worst excesses in the humanities go hand in hand with speculative takes on cosmology and quantum physics. And, I am putting aside the interesting issues of the efficacy of peer-review and the extent to which a discipline can trust its members not to flat out lie -- such as the case of disgraced Korean stem cell researcher Hwang Woo-suk. My main point here is to the extent that we should strive, and hold others accountable to, a standard of simplicity, or as Einstein said "as simple as possible, but not simpler."

For my own purposes I've come to view that which is incomprehensible to me as perhaps like medieval Scholasticism -- famously parodied with the question of "how many angels can dance on the head of a pin?". Thomas Aquinas and Peter Lombard were no doubt far smarter than me, but if one starts with particular set of assumptions (e.g., textual inerrancy), fetishize a logic over a broader rationality (e.g., dialectics, be they Christian or Marxist), and lack an understanding of how we fool ourselves (e.g., confirmation bias) I feel we can end up with brilliant nonsense. (Frederick Crews is famous for his criticism of Freudianism along these lines and his latest book is aptly titled "Follies of the Wise.")

And I now have a new term for describing those works that are manifestedly learned but for which I'm confused as to whether I'm too dumb to understand or they are simply incomprehensible. Herman Kogan (1958), in the "The Great EB: the Story of the Encyclopaedia Britannica," writes of the Britannica's editors difficulty with the "Algebraic Forms" article which was so complex that it was referred to different experts to assess whether it was sensible. In the final review, Simon Newcomb of Johns Hopkins University wrote, "It's magnificent, although I am not sure it is all clear to me but it's really magnificent." Consequently, the editor rejected the article as being "too magnificent" (p. 90).

this entry posted to method;
comments (1)

2007 Aug 13 | Source-as-primary-character

I recently finished Peter Heather's (2006) The Fall of the Roman Empire. This popular, though no less rigorous, history is widely praised. The narrative is engaging and I appreciate the glossary, dramatis personae, and timeline; these help given the scope of the book spans 150 years, dozens of emperors (East and West), generals, and barbarian Kings. What most impressed me was Heather's treatment of sources. Many histories, particularly of ancient societies, are written in the third person objective. Yet, as I learned in my historical methods course, the practice of history is more than a recounting of events, but a substantiated argument about people and events in time. Heather presents his arguments as such: identifying when he agrees or disagrees with others or scholarly consensus, and addressing the circumstances of his sources. Rather than being simply a footnote, sources come to the foreground and become part of the story. A history of the source, such as Pullodius' commentary on Ambrose written in the margins of De Fide, or the listing of fourth century military and civilian offices, the Notitia Dignitatum, are interesting in themselves and contribute to a much deeper understanding of the ground on which Heather's arguments rest. While a popular history might present a more accessible or exciting version of an old tale, it is rare for it to communicate the challenges and excitement within the discipline -- because popular history often obscures its scholarship. But Heather brings it forward and what I thought might be a rather staid field -- don't we already know all we can do about the ancients? -- is shown as alive with new archaeological finds, textual fragments, analysis, and argument.

I know this will influence the next revision of one of my historical chapters with respect to how I speak about some of the primary sources I found.

this entry posted to method;
comments (0)

2007 Mar 15 | Reuse vs. self plagiarism

Yesterday's New York Times reported on another case of high profile plagiarism: a relatively young professor who had copied parts of her dissertation from another. Even though she had previously acknowledged as much in private and has now resigned -- so there's no question of ambiguous boundaries -- a few things struck me as salient:

Interestingly, this article came along at the same time I have been following an interesting discussion of turnitin, a student plagiarism detection service, and struggling with issues of "reusing" my own work.

Unlike my previous issue of how to deal with priority in relation to self-published grey literature, my present concern arises out of published work. I am presently working on Chapter 4 of my dissertation which specifies criteria for an "open content community" as well as some interesting boundary cases on openness. A dissertation is, understandably, supposed to be an original work; I read this as "new work since matriculation" but I have heard it said this could mean only unpublished work: this would be horrible. Perhaps my strong view is partly because I'm a "mid-career" Ph.D. student that has already presented papers and it strikes me as contrary to stop a professional activity that is essential for getting feedback. I also appreciate new Ph.D.'s reuse their dissertation in subsequent articles and/or books, which I also plan to do. But to sit on all that material and labor on in solitude -- aside from one's dissertation committee members -- until the dissertation is complete seems counterproductive. Consider the genealogy of parts of the present chapter:

In no case did I assign any copyright -- though they of course are published under various copyright licenses -- and so I am not legally precluded from using them in compilations or derivative works. Making them available has provided me with feedback and opportunities for publication which yields more feedback and builds relationships within my scholarly community. This is great! But what of "self-plagiarism"? (So, perhaps this is like my earlier post but questions of priority and public but "unpublished" work are exchanged for questions of "published" works and self plagiarism.)

I've been reading up on the topic and found Green (2005) interesting, and Hexham (1999) useful:

Self-plagiarism must be distinguished from the recycling of one's work that to a greater or lesser extent everyone does legitimately. Although self-plagiarism in academic publications is a gray area many universities implicitly recognize the practice as fraudulent. Thus most universities have rules preventing students from submitting essentially the same essay for credit in different courses. There are also rules against someone submitting the same thesis to different universities. Among established academics self-plagiarism is a problem when essentially the same article or book is submitted on more than one occasion to gain additional salary increments or for purpose of promotion.

Like all plagiarism, self-plagiarism occurs when the author attempts to deceive the reader. This happens when no indication is given that the work is being recycled or when an effort is made to disguise the original text. The issue once again is one of deception. Disguising a text occurs when an author makes cosmetic changes that make the same book or paper look different when it actually remains unchanged in its central argument. Changing such things as paragraph breaks, capitalization, or the substitution of technical terms in different languages, causes readers to believe they are reading something completely new. If these are the only changes an author has made then they may be legitimately described as self-plagiarism and fraudulent.

The extent of re-cycling is also an indication of self-plagiarism. Academics are expected to republish revised versions of their Ph.D. thesis. They also often develop different aspects of an argument in several papers that require the repetition of certain key passages. This is not self-plagiarism if the complete work develops new insights. It is self-plagiarism if the argument, examples, evidence, and conclusion remain the same in two works that only differ in their appearance.

Which brings me, finally, to my simple and mundane question for my dissertation. Is a citation to my own published works sufficient if I am reusing text -- though continuing to rework and integrate it -- or should I also give an acknowledgment often seen in scholarly books that "portions of this text are republished from or based on...")?

(BTW: a possible irony is I expect this and earlier entries could be turned into a decent paper on "scholarship in the open" should the opportunity ever present itself!)

this entry posted to method;
comments (2)

2007 Feb 13 | Grey literature, stigmergy and priority

Last week I read a provocative paper by Helen Nissenbaum (2002) where she considers the norms, values, and ends previously served by the convention of scholarly priority, and, now that the contextual landscape is changing because of electronic media, whether intellectual property (patents) can serve just as well in their stead. Helen recommended it to me while we were discussing my dissertation chapter on encyclopedic production, including questions of copyrights and plagiarism. This chapter is partly based on a draft I wrote in 2005 in which I argued the concept of stigmergy is helpful in understanding the sort of socialty involved in the cumulative production of knowledge in reference works.

An irony is that Nissenbaum's paper speaks to the question of scholarly priority in the age of the Internet, which bears on my adoption of the term stigmergy. (She doesn't mention blogs or wikis, but instead refers to "wildcat publishers," "grey literature," and whether there is any scholarly obligation to search these realms for the purposes of citation.)

I think I first wrote of stigmergy in the spring of 2005, in a draft I made available on this blog on September 30. Roughly a year later, I read Mark Elliott's piece Stigmergic Collaboration: The Evolution of Group Work in the May issue of the online MC/ Journal. Elliott explores the idea much more thoroughly than I did or will, and that is good. But how do I deal with the question of priority and citation? I definitely want to -- and do -- cite Elliott in my present version of the chapter, but what to do with my earlier version? I don't know Elliott and assume he knows nothing of me. And I don't feel that proprietary about saying Wikipedia might be stigmergic. And for all I know we read the same thing about wasps -- though I was also inspired by early reference work compilers likening their copying of others' work to a useful "busy bee." But I don't want it to appear I am simply borrowing the idea from elsewhere and I prefer not to cite earlier "unpublished" drafts. This concern with priority is in the face of the biggest irony of all: an argument of this chapter is that knowledge is inherently interdependent and cumulative!

Presently, the text in question reads:

Stigmergy is a term coined by Pierre-Paul Grasse to describe how wasps and termites collectively build complex structures; as Karsai (2004:101) writes, it "describes the situation in which the product of previous work, rather than direct communication among builders, induces [and directs how] the wasps perform additional labor." In addition to my proposal that this notion might be helpful in understanding Wikipedia collaboration (Reagle 2005fss), Mark Elliot (2006) has also, more thoroughly, argued the same: "As stigmergy is a method of communication in which individuals communicate with one another by modifying their local environment… [t]he concept of stigmergy therefore provides an intuitive and easy-to-grasp theory for helping understand how disparate, distributed, ad hoc contributions could lead to the emergence of the largest collaborative enterprises the world has seen" (Elliott 2006:4). However, we need not apply this notion only to new media. For example, stigmergy might also be applicable to Newton’s seemingly generous sentiment of acknowledging the contributions of his predecessors: "If I have seen further [than you and Descartes] it is by standing upon ye shoulders of giants." (As cited in a 1676 letter from Newton to Hooke, by Merton (1993), who details a long history of this aphorism and Newton's probably less than magnanimous intention (Hawking 2002) of insulting Robert Hooke, his short and hunchbacked rival.)

Is this appropriate?

this entry posted to method;
comments (0)

2007 Jan 23 | Broken lists

I'm presently cursing whoever changed the configuration/names of Wikipedia lists. Identifying emails in archives is sadly a difficult problem, it really need not be, but fortunately the good folks at the aimsgroup MARC also archive the lists and associate the unique identifier of every message with a persistent and unique URL, as I wrote about previously. But when Wikipedia moved its lists from "foo@wikimedia.org" to "foo@lists.wikimedia.org" it not only broke email filters across the land, it broke the MARC archives evidently. No message is available in the MARC archive since the change, on January 6. Now, Wikipedians are realizing that many of the links from the Wikis to email messages (e.g., referencing a message on the Wikimedia Foundation list) are broken.

My backlog of email messages to scrutinize is growing as I hope Hank Leininger and the other volunteers at MARC find the time and means to address the problem. What would be great is if Wikipedia and other users of archive software (i.e., mailman) pressed for stable references to messages as a priority feature!

this entry posted to method;
comments (4)

2006 Oct 20 | A note on bibliography

I'm sharing this note from the beginning of my dissertation so others working with online resources might comment.

The type and number of bibliographic sources of this work merit a couple comments.

First, most of the primary sources are online, and have only been online. Quotations from e-mail and most exclusively online resources have no page numbers associated with them.

Second, many of the printed sources (primary and secondary) are now online. This is common in recent works where authors place versions of a print publication online, or where older works are now in the public domain and have been republished online. In such cases I use the publication date of the version I used. If necessary, I include the original publication date in prose adjacent to the reference, and I include it in the title of the work in the bibliography. For example the bibliographic entry for Project Gutenberg's 2004 republication of H. G. Wells' "A Modern Utopia" would be:

Wells, H. (2004). A modern utopia (1905). (6424). Retrieved on September 20, 2006 from < http://www.gutenberg.org/dirs/etext04/mdntp10h.htm >.

The page numbers associated with print-only sources obviously correspond to the printed page. For those sources that are also online, the page number might be associated with the pagination of the printed online resource from which I first took my notes, or the printed material, for which I later found an online copy. I believe it will be clear to the reader which is the case.

Third, for some recent sources, there are many publications by the same author in the same year. After a couple of years of experimentation with the software I use to manage this material I have settled upon the convention of identifying such a source by appending a token to the publication year that is composed of the first three substantive words of the title. So, instead of using the letters [a-z], which some bibliographic systems use, my reference for Wikipedia's "Neutral Point of View" article is: (Wikipedia 2006npv). This provides stability across additions/subtractions to the bibliography and across chapters, and is comprehensible to the author and hopefully the reader.

Finally, Web sources do change, particularly Wiki pages! Wherever possible I include the date of the version of the resource to which I am referring. Wikimedia resources are also identified by their versioned, "stable" or "permanent," URL. It is possible that I will reference different versions of the same Wiki page.

All of this may sound confusing, and it was no easy task coming to this understanding, but in the end I hope it is useful. If the intention of bibliography is to permit the reader to follow the author's journey through the sources, the ready accessibility of online resources is a boon to all.

this entry posted to method;
comments (2)

2006 Sep 04 | Outsider Contributions

When I make a substantive contribution to Wikipedia, I tend to edit "off-line" until I'm satisfied with the text, and then post it in a single chunk. While I am only a WikiGnome in any case, the typical Wikipedia metric of "edit counts" would underestimate the contribution made by people who edit in a similar fashion. My own simple Python script exhibits this problem. To get some sense of the substance of any given edit, one would have to go beyond screen-scraping and perform analysis on the Wikipedia database -- something beyond my desktop computer. Fortunately, Aaron Swartz purchased "some time on a computer cluster" and came up with the following novel result:

When you put it all together, the story become clear: an outsider makes one edit to add a chunk of information, then insiders make several edits tweaking and reformatting it. In addition, insiders rack up thousands of edits doing things like changing the name of a category across the entire site -- the kind of thing only insiders deeply care about. As a result, insiders account for the vast majority of the edits. But it's the outsiders who provide nearly all of the content.

I'm looking forward to seeing these findings replicated.

this entry posted to method;
comments (0)

2006 Jun 12 | Nupedia-l Archives

I recently completed my review and analysis of the Nupedia e-mail list archives. Since they are no longer easily accessible, I thought I would share the raw archives: nupedia-l.tar.bz2. This HTML version of the e-mail archives was extracted from the Internet Archive via the following command:

wget --exclude-domains gizmology.net -e robots=off -nH --cut-dirs=3 --base=http://web.archive.org/web/20030822044803/http://www.nupedia.com/pipermail/nupedia-l/ -r -l 4 -N -k -p -R js -Gbase http://web.archive.org/web/20030822044803/http://www.nupedia.com/pipermail/nupedia-l/

I believe this archive contains additional textual processing subsequent to the `wget` to make it more useful to me.

If you wish to access the messages from this archive, turn off your JavaScript. Otherwise, you will be taken to the online version when you click on a link, which can be slow. However, accessing the online Web version can be useful if I failed to gather a copy from "20030822044803" date space of the archive and you want to try other periods. (One can also find an mbox-like file of the messages though it would require a lot of work to make it compliant to the mbox format.) This is a tar archive compressed with bzip. If you are inclined to cite this collection you can note it is part of the "Reagle Wikipedia Archive."

this entry posted to method;
comments (0)

2006 Jun 09 | The method of haiku

A Zen-inspired aesthetic of haiku is sabi: an insightful appreciation of the "suchness" of ordinary objects and daily events. Hass (1994:xiv) writes of this as a "quality of actuality, of the moment seized on and rendered purely." This pureness of vision led Barthes (1983:60) to claim that haiku's "brevity would guarantee their perfection," their "simplicity would attest to their profundity."

I am foolish enough to aspire towards this quality in my own work. Of course, in my dissertation proposal I cloak my poetic inspiration with sympathetic methodological scholarship:

Yet, there is a goal that I aspire to, my research "should be empirical enough to be credible and analytical enough to be interesting" (van Maanen1988:29). I hope to make a convincing contribution (Golden-Biddle and Locke 1993) by providing an account that has authenticity, "the ability of the text to convey the vitality of everyday life encountered by the researcher in the field setting" (p. 599), plausibility, "the ability of the text to connect two worlds [of the writer and reader] that are put in play in the reading of the written account" (p. 600), and criticality, "the ability of the text to actively probe readers to reconsider there taken-for-granted ideas and beliefs" (p. 600).

I recognize this aspiration is foolish because it is not the norm, as I understand academia. I have long characterized my own stance as a "reflective practitioner," a seemingly rare and unsupported breed. I do not claim a perfectly impartial objective and outsider perspective; I reach for analytical, reflective, distance while appreciating that those most familiar with a phenomenon also understand its faults the best, however much they are attached to it. This posture opens me up to criticisms of losing impartiality, for having "gone native." (But, of course, I was already partially native and "critical" should not always mean pejorative.) Or, some will ask "what is the contribution to theory?" This question is important but incomplete to my mind, its companion should be: "and what is the contribution to practice?" For what is the point of a field that follows the world so as to only argue about how we should argue about it? In his study of Quaker decision-making Sheeran (1996:xiv) wrote in his preface :

Social scientists and political philosophers are invited to discover in Quakers what may be the only modern Western community in which decision-making achieved the group-centered decisions of traditional societies. In the Conclusion, the author discusses Friends as a possible answer to the common contemporary wish for enhancement beyond the fragmented individuation of "liberal" man.

Finally, the author hopes Quakers themselves will find in these pages a helpful mirroring of Friends decision-making. Newcomers to Quakerism and those who find themselves in roles of leadership within the community may find in this study an outsider's understanding of the possibilities and pitfalls of the Quaker method of going beyond majority rule.

This strikes me as an worthwhile balance, one I hope to achieve is well.

this entry posted to method;
comments (0)

Open Communities, Media, Source, and Standards XML

by Joseph Reagle

powered by pyblosxom


reagle.org

What I'm reading online (blogroll)


Categories

Archives