Open Codex method :: wikipedia-history-scraping

2005 Dec 13 | Wikipedia History Scraping

To confirm the power law in Wikipedia edits (many doing a little, a few doing much) this regular expression and Python code parses a Wikipedia history fairly well:

history_regex = r""".*?oldid=(\d+).*(\d\d:\d\d.*?\d\d\d\d)</a>.*<span class='history-user'>.*?>(.*?)</a>.*(?:<span class='comment'>(.*?)</span>)?</li>"""
regex_obj = re.compile(history_regex)

url = sys.argv[1]
html = getHTML(url)
lines = html.split('\n')
for line in lines:
    if line.startswith("<li>(<a"):
        counter = counter+1
        match_obj = regex_obj.search(line)
        if match_obj:
            oldid,date,author,comment = match_obj.groups()
            edits.setdefault(author,[]).append((oldid,date,author,comment))
counts = [(author,len(edits[author])) for author in edits.keys()]
counts_s = sorted(counts, reverse=True, key=operator.itemgetter(1))
print counter
for author,number in counts_s:
    print author, ";", number

this entry posted to method;
comments (2)




Posted by Jakob at Fri Dec 16 06:37:48 2005
And what are your results? In my masters thesis I found out that Lotka's law applies very well for authors with small numbers of edits but there is a deviation for the most active editiors. More precise results will be published soon.

Posted by Joseph Reagle at Fri Dec 16 08:55:26 2005
"For example, a brief analysis shows that on the Harry Potter page, of the 295 editors who made the last 500 edits, the top 10% of editors made 29% of the edits. A similar pattern is found on other Harry Potter Project pages."



Name:


E-mail:


URL:


Comment:


NoSpam Magic Word:
The opposite of closed (the first word of this blog's title) is?

Open Communities, Media, Source, and Standards XML

by Joseph Reagle

powered by pyblosxom


reagle.org

What I'm reading online (blogroll)


Categories

Archives