Brandt and Wikipedia plagiarism

Since I am a student of plagiarism in reference work production, I read with interest Daniel Brandt's (2006) analysis of Plagiarism by Wikipedia editors. He claims that he found 145 instances of Wikipedia plagiarizing others (roughly 1% of his sample), and projects that the "plagiarism rate on Wikipedia is at least 2%."

I have a few of thoughts, the first of which are statistical nuances. For example, the size of the sample that was used to calculate the 1% rate is not clear. His description of the winnowing down of his original corpus to the plagiarized cases is somewhat confusing, but this is how I understood it:

  • 16,750: the original biographical articles 
  • 12,095: those articles containing one or more "clean" sentences not obfuscated by wiki markup 
  • 5,867: those articles for which Google returns a verbatim copy of a clean sentence from another online source 
  • 1,682: articles remaining after removing "rogue" sites 
  • 145: the final plagiarizing articles

Because his 1% figure is rounded, it's not clear if his divisor is the 16,750 articles he started with (yielding 0.087) or the 12,750 articles in which he found at least one clean sentence (yielding 0.012); I think it should be the latter: 1.2%.

Also, it's difficult to follow the details of the winnowing of the 5,867 articles for which Google found duplication, to the 1,682 articles that weren't "rogue," to the 145 "confirmed" plagiarism, but clearly the bulk of duplications were those containing material from Wikipedia. This invites the question of how much is Wikipedia itself plagiarized?

And, remember, these are descriptive statistics of his sample only. Before making any inferential claim to the whole of the Wikipedia one has to ask about the sampling methodology: why biographical articles, and are they more or less likely to have plagiarism than others? (In my experience I find biographical plagiarism to be common.) Also, we have no parameters for the confidence of the inference. For example, to be very confident (i.e., 99%) that the inferred estimate is only off by 1%, one would need a proper random sample of 16,453 ("clean") articles.

Finally, and more substantively, Brandt conflates plagiarism with verbatim copying. As Posner (2007:12) writes: "not all plagiarism is copyright infringement and not all copyright infringement is plagiarism." I'm not sure if he excludes all public domain copying (which still might be considered plagiarism), or only unsourced public domain material. Also, elsewhere, I wrote how I found a verbatim copy of text in a biographical article, raising copyright infringement concerns. After placing my report on the copyright infringement page (there can be dozens of reports today), the infringing text was rewritten perhaps removing it from the scope of copyright -- and Brandt's method -- but perhaps not from the cloud of plagiarism.

Consequently, my understanding of Brandt is that his conclusion should be read as: 1.2% of a sample of Wikipedia biography articles appear to contain infringing verbatim copying from other online sources. Brandt's approach was conservative, but isn't really about the whole of the much larger, but murkier, domain of (non-verbatim) plagiarism.

Comments !