Wikipedia History Scraping

To confirm the power law in Wikipedia edits (many doing a little, a few doing much) this regular expression and Python code parses a Wikipedia history fairly well:

history_regex = r""".*?oldid=(\d+).*(\d\d:\d\d.*?\d\d\d\d)</a>.*<span class='history-user'>.*?>(.*?)</a>.*(?:<span class='comment'>(.*?)</span>)?</li>"""
regex_obj = re.compile(history_regex)

url = sys.argv[1]
html = getHTML(url)
lines = html.split('\n')
for line in lines:
    if line.startswith("<li>(<a"):
        counter = counter+1
        match_obj =
        if match_obj:
            oldid,date,author,comment = match_obj.groups()
counts = [(author,len(edits[author])) for author in edits.keys()]
counts_s = sorted(counts, reverse=True, key=operator.itemgetter(1))
print counter
for author,number in counts_s:
    print author, ";", number

Ported/Archived Responses

Joseph Reagle on 2005-12-16

"For example, a brief analysis shows that on the Harry Potter page, of the 295 editors who made the last 500 edits, the top 10% of editors made 29% of the edits. A similar pattern is found on other Harry Potter Project pages."

Jakob on 2005-12-16

And what are your results? In my masters thesis I found out that Lotka's law applies very well for authors with small numbers of edits but there is a deviation for the most active editiors. More precise results will be published soon.

Comments !