Wikipedia 10K redux

Sadly, for those of us interested in Wikipedia’s history, much material has been lost. For example, I found that it was common for people involved in Wikipedia’s founding to no longer have their e-mails from that period. (Public archives fared a little better, but not perfectly so, as seen in my discussion of recovering the Nupedia archives.) Even worse, the earliest state of Wikipedia’s pages did not survive upgrades to the platform’s software. Yet, Fortuna has smiled upon Wikipedia in anticipation of its 10th birthday. Tim Starling, a Wikimedia developer and about as ancient as Wikipedia old-timers can get, writes:

I was looking through some old files in our SourceForge project. I opened a file called wiki.tar.gz, and inside were three complete backups of the text of Wikipedia, from February, March and August 2001!

However, these logs are not easily read, as they are a collection of changes to a page. So understanding what a page looked like at a given time is difficult, and requires one to iteratively apply all previous changes to that page. I have written a little script to do this, and provide a listing of Wikipedia’s first pages.

Wikipedia 10K Redux

This page is a reconstruction of the first 10,000 Wikipedia contributions, roughly Wikipedia’s first six weeks, based on data provided by Starling. It is likely buggy, and I hope/expect it will soon be superseded by a proper importation into a wiki by Starling and/or others. (For example, text is not formatted nor are there links.)

The reconstruction is not perfect, there are patches that won’t apply, manually moved articles, and text encodings that I don’t manage to guess at. But it does permit some preliminary browsing, which leaves the following initial impressions:

  • There is a lot of silly stuff in there.
  • Tim Shell contributed a fair amount of content.
  • Popular topics seemingly include philosophy, geography, the Dewey Decimal System, Ernest Hemingway, the United States (and its Constitution), Isaac Asimov, the Japan Constitution, Metallica, statistics, and – my goodness, true to the Objectvist conspiracy theories – a huge collection of articles on Atlas Shrugged.

Ported/Archived Responses

Joseph Reagle on 2010-12-18

The initial article on Altruism is also Randian:

“Altruism. Altruism doesn’t really exist. Even if you are doing something that is helpful or charitable for others, you will still reap the good of that action. The idea that one should do things for altruistic purposes is wrong and can lead to the downfall of society, as Ayn Rand wrote about in Atlas Shrugged.”

Joseph Reagle on 2010-12-17

I fixed some of the encoding issues. The DB dump contained different encodings. So, the encoding of each diff in the dump is now independently guessed using Python’s CharDet (Universal Encoding Detector) library.

So now you can read up on the few “accented” topics in the early Wikipedia including: Göteborg, Köpenhamn, and København. (Nothing very exciting.) But it means articles, such as ASCII, are much improved as well. Interestingly, the ASCII page isn’t about ASCII itself so much, but as to how to type non-ascii characters in the early Wikipedia.

lior on 2011-04-04

This is such a great resource! I’m writing on the early days of wikipedia, and this is simply invaluable.
Thanks for making this available.

Benjamin Mako Hill on 2010-12-19

Great news indeed! Thanks for helping with the encoding issues. This is great to have and very interesting to look through.

Joseph Reagle on 2011-05-04

Lior, I’ll email you off blog.

Peter Damian on 2010-12-17

How interesting. My first recollection of Wikipedia was via the article on proper names

which struck me as quite good (Sanger is not a bad philosopher, although the articles were all obvious copies of lecture notes without any Wikification). I started editing in 2003, mostly on the philosophy. The story of philosophy in Wikipedia since then has been a troubled one, however.

Steven Walling on 2010-12-20

Interesting. Compared to the earliest list of names from the WikiPedians page (culled from people who had signed their edits recently) lists:

> * [[Josh Grosse]]
> * [[Phil Bordelon]]
> * [[Larry Sanger]], cofounder of [[wikipedia]] and editor-in-chief of [[nupedia]]
> * [[Malcolm Farmer]]
> * RoseParks
> * SoniC
> * [[Mathijs]]
> * [[Lee Daniel Crocker]]
> * [[Jimbo Wales]], the other cofounder of this great project
> * WojPob
> * CliffordAdams
> * [[Bomfog]]
> * [[Jmlynch]]
> * TimShell, to whom we owe the 13% of wikipedia devoted to [[Atlas Shrugged]]
> * [[Cdani]]
> * AstroNomer
> * [[Suzanne Elsasser]]
> * [[Pguiral]]
> * [[Dick Beldin]], to whom we owe most of the [[Statistics]] pages
> * [[Invictus]]
> * [[John Abbe]]

I’m trying to figure out if any of these early volunteers (so not Jimmy, Larry, or Bomis employees) are still around and editing.

llywrch on 2010-12-21

A few notes on this list of early contributors:
* Tim Shell was Wales’ partner at Bomis. Some of the anonymous edits from the domain are likely his.
* John Abbe – I know John, but didn’t know he had been associated with Wikipedia at such an early date. I saw him at the 2006 Wikimania in Boston, & remember him telling me later that he had twice failed to become an Admin because people didn’t think he had enough experience.
* As for the other mysterious usernames, we may never know who they were. Apparently the habit of joining Wikipeida to make a few edits – or in some cases, a few hundred – then vanishing is not a new development.


Peter Damian on 2010-12-17

I wrote a short piece about the first philosophy article here . Very interesting, thanks for doing this, Joseph.

Peter Damian on 2010-12-19

Interesting how many elementary philosophical mistakes that can be packed into a few brief sentences. E.g. the statement (a) that altruism doesn’t exist and (b) that it is wrong. (Sanger almost certainly would not have written that).

I traced the history of the ‘proper name’ article from the one linked to above to the present version. Hasn’t really changed that much, but it got carved up and the pieces put in different places.

lior on 2011-04-28

I’ve used the archive as far as it goes, but now I’m trying use the diff_log that Tim Starling posted to read further edits, and I have a problem I hope you can help me with: I can’t find the time of publication for each edit. I know there’s a timestamp for every edit, but I don’t know how to translate it into date and time. How did you extract it for your database?

Comments !