Open Codex social :: wikipedia :: brandt-plagiarism

2007 Mar 29 | Brandt and Wikipedia plagiarism

Since I am a student of plagiarism in reference work production, I read with interest Daniel Brandt's (2006) analysis of Plagiarism by Wikipedia editors. He claims that he found 145 instances of Wikipedia plagiarizing others (roughly 1% of his sample), and projects that the "plagiarism rate on Wikipedia is at least 2%."

I have a few of thoughts, the first of which are statistical nuances. For example, the size of the sample that was used to calculate the 1% rate is not clear. His description of the winnowing down of his original corpus to the plagiarized cases is somewhat confusing, but this is how I understood it:

Because his 1% figure is rounded, it's not clear if his divisor is the 16,750 articles he started with (yielding 0.087) or the 12,750 articles in which he found at least one clean sentence (yielding 0.012); I think it should be the latter: 1.2%.

Also, it's difficult to follow the details of the winnowing of the 5,867 articles for which Google found duplication, to the 1,682 articles that weren't "rogue," to the 145 "confirmed" plagiarism, but clearly the bulk of duplications were those containing material from Wikipedia. This invites the question of how much is Wikipedia itself plagiarized?

And, remember, these are descriptive statistics of his sample only. Before making any inferential claim to the whole of the Wikipedia one has to ask about the sampling methodology: why biographical articles, and are they more or less likely to have plagiarism than others? (In my experience I find biographical plagiarism to be common.) Also, we have no parameters for the confidence of the inference. For example, to be very confident (i.e., 99%) that the inferred estimate is only off by 1%, one would need a proper random sample of 16,453 ("clean") articles.

Finally, and more substantively, Brandt conflates plagiarism with verbatim copying. As Posner (2007:12) writes: "not all plagiarism is copyright infringement and not all copyright infringement is plagiarism." I'm not sure if he excludes all public domain copying (which still might be considered plagiarism), or only unsourced public domain material. Also, elsewhere, I wrote how I found a verbatim copy of text in a biographical article, raising copyright infringement concerns. After placing my report on the copyright infringement page (there can be dozens of reports today), the infringing text was rewritten perhaps removing it from the scope of copyright -- and Brandt's method -- but perhaps not from the cloud of plagiarism.

Consequently, my understanding of Brandt is that his conclusion should be read as: 1.2% of a sample of Wikipedia biography articles appear to contain infringing verbatim copying from other online sources. Brandt's approach was conservative, but isn't really about the whole of the much larger, but murkier, domain of (non-verbatim) plagiarism.

this entry posted to social/wikipedia;
comments (0)






Name:


E-mail:


URL:


Comment:


NoSpam Magic Word:
The opposite of closed (the first word of this blog's title) is?

Open Communities, Media, Source, and Standards XML

by Joseph Reagle

powered by pyblosxom


reagle.org

What I'm reading online (blogroll)


Categories

Archives