Open Codex HISTORICAL entry

2005 Dec 13 | Wikipedia History Scraping

To confirm the power law in Wikipedia edits (many doing a little, a few doing much) this regular expression and Python code parses a Wikipedia history fairly well:

history_regex = r""".*?oldid=(\d+).*(\d\d:\d\d.*?\d\d\d\d)</a>.*<span class='history-user'>.*?>(.*?)</a>.*(?:<span class='comment'>(.*?)</span>)?</li>"""
regex_obj = re.compile(history_regex)

url = sys.argv[1]
html = getHTML(url)
lines = html.split('\n')
for line in lines:
    if line.startswith("<li>(<a"):
        counter = counter+1
        match_obj = regex_obj.search(line)
        if match_obj:
            oldid,date,author,comment = match_obj.groups()
            edits.setdefault(author,[]).append((oldid,date,author,comment))
counts = [(author,len(edits[author])) for author in edits.keys()]
counts_s = sorted(counts, reverse=True, key=operator.itemgetter(1))
print counter
for author,number in counts_s:
    print author, ";", number
Posted by Jakob at Fri Dec 16 06:37:48 2005
And what are your results? In my masters thesis I found out that Lotka's law applies very well for authors with small numbers of edits but there is a deviation for the most active editiors. More precise results will be published soon.

Posted by Joseph Reagle at Fri Dec 16 08:55:26 2005
"For example, a brief analysis shows that on the Harry Potter page, of the 295 editors who made the last 500 edits, the top 10% of editors made 29% of the edits. A similar pattern is found on other Harry Potter Project pages."


Open Communities, Media, Source, and Standards

by Joseph Reagle


reagle.org