<?xml version="1.0" encoding="iso-8859-1"?>

<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
<title type="text">Joseph Reagle</title>
<subtitle type="html"><![CDATA[
Open Communities, Media, Source, and Standards
]]></subtitle>
<id>http://reagle.org/joseph/blog/social/wikipedia/brandt-plagiarism</id>
<link rel="alternate" type="text/html" href="http://reagle.org/joseph/blog" />
<link rel="self" type="application/atom+xml" href="http://reagle.org/joseph/blog/social/wikipedia/brandt-plagiarism?flav=atom" />


<author>
<name>Joseph Reagle</name>
<uri>http://reagle.org/joseph/blog/social/wikipedia/brandt-plagiarism</uri>
<email></email>
</author>
<rights>Copyright 2003-2010 Joseph Reagle</rights>
<generator uri="http://pyblosxom.sourceforge.net/" version="1.4.3 01/10/2008">
PyBlosxom http://pyblosxom.sourceforge.net/ 1.4.3 01/10/2008
</generator>

<updated>2007-03-29T22:55:48Z</updated>
<!-- icon?  logo?  -->

<entry>
<title type="html">Brandt and Wikipedia plagiarism</title>
<category term="" />
<id>http://reagle.org/joseph/blog/2007/03/29/brandt-plagiarism</id>
<updated>2007-03-29T22:55:48Z</updated>
<published>2007-03-29T22:55:48Z</published>
<link rel="alternate" type="text/html" href="http://reagle.org/joseph/blog/social/wikipedia/brandt-plagiarism.html" />
<content type="html">

&lt;p&gt;Since I am a student of plagiarism in reference work production, I read with interest Daniel Brandt&apos;s (2006) &lt;a href=&quot;http://www.wikipedia-watch.org/psamples.html&quot;&gt;analysis&lt;/a&gt;  of &lt;em&gt;Plagiarism by Wikipedia editors&lt;/em&gt;.
He claims that he found 145 instances of Wikipedia plagiarizing others
(roughly 1% of his sample), and projects that the &quot;plagiarism rate on
Wikipedia is at least 2%.&quot;&lt;/p&gt;

&lt;p&gt;I have a few of thoughts, the first of which are statistical
nuances. For example, the size of the sample that was used to calculate
the 1% rate is not clear. His description of the winnowing down of his
original corpus to the plagiarized cases is somewhat confusing, but
this is how I understood it:&lt;/p&gt;

&lt;ul&gt;&lt;li&gt;16,750: the original biographical articles&amp;nbsp;&lt;/li&gt;&lt;li&gt;12,095: those articles containing one or more &quot;clean&quot; sentences not obfuscated by wiki markup&amp;nbsp;&lt;/li&gt;&lt;li&gt;5,867: those articles for which Google returns a verbatim copy of a clean sentence from another online source&amp;nbsp;&lt;/li&gt;&lt;li&gt;1,682: articles remaining after removing &quot;rogue&quot; sites&amp;nbsp;&lt;/li&gt;&lt;li&gt;145: the final plagiarizing articles&lt;/li&gt;&lt;/ul&gt;

&lt;p&gt;Because his 1% figure is rounded, it&apos;s not clear if his divisor is
the 16,750 articles he started with (yielding 0.087) or the 12,750
articles in which he found at least one clean sentence (yielding
0.012); I think it should be the latter: 1.2%.&lt;/p&gt;

&lt;p&gt;Also, it&apos;s difficult to follow the details of the winnowing of the
5,867 articles for which Google found duplication, to the 1,682
articles that weren&apos;t &quot;rogue,&quot; to the 145 &quot;confirmed&quot; plagiarism, but
clearly the bulk of duplications were those containing material from
Wikipedia. This invites the question of how much is Wikipedia itself
plagiarized?&lt;/p&gt;

&lt;p&gt;And, remember, these are descriptive statistics of his sample only.
Before making any inferential claim to the whole of the Wikipedia one
has to ask about the sampling methodology: why biographical articles,
and are they more or less likely to have plagiarism than others? (In my
experience I find biographical plagiarism to be common.) Also, we have
no parameters for the &lt;a href=&quot;http://en.wikipedia.org/wiki/Confidence_interval&quot;&gt;confidence&lt;/a&gt;
of the inference. For example, to be very confident (i.e., 99%) that
the inferred estimate is only off by 1%, one would need a proper random
sample of 16,453 (&quot;clean&quot;) articles.&lt;/p&gt;

&lt;p&gt;Finally, and more substantively, Brandt conflates plagiarism
with&amp;nbsp;verbatim copying. As Posner (2007:12) writes: &quot;not all
plagiarism is
copyright infringement and not all copyright infringement is
plagiarism.&quot; I&apos;m not sure if he excludes all public domain copying
(which still might be considered plagiarism), or only unsourced public
domain material. Also, elsewhere, I wrote how I found a verbatim copy
of text in a biographical article, raising copyright infringement
concerns. After placing my report on the &lt;a href=&quot;http://en.wikipedia.org/wiki/Wikipedia:Suspected_copyright_violations&quot;&gt;copyright infringement page&lt;/a&gt;
(there can be dozens of reports today), the infringing text was
rewritten perhaps removing it from the scope of copyright -- and
Brandt&apos;s method -- but perhaps not from the cloud of plagiarism.&lt;/p&gt;

&lt;p&gt;Consequently, my understanding of Brandt is that his conclusion
should be read as: 1.2% of a sample of Wikipedia biography articles
appear to contain infringing verbatim copying from other online
sources. Brandt&apos;s approach was conservative, but isn&apos;t really about the
whole of the&amp;nbsp;much larger, but murkier, domain of (non-verbatim)
plagiarism.&lt;/p&gt;</content>
</entry>
</feed>
