Posts Tagged: wikipedia


13
Sep 07

Truth in Metadata – WikiDashboard

A few days ago, I came across WikiDashboard, a tool that lets you visualize temporal change patterns in Wikipedia. I wrote about it at techPresident, but I wanted to expand a little on why I think the tool is so interesting. So first a little about the tool; Developed by the Augemented Social Cognition team at Xerox PARC, WikiDashboard is an overlay of Wikipedia, when you browse to a page you are presented a graph exploring the dynamics of recent edits to the page. Here’s a quick glance at an example:

In this particular example, we see the WikiDashboard for Facebook’s Wikipedia entry. In the dashboard you can see intensity of edits over time, edits contributed by a user inside the time interval, the intensity of their edits (more edits = darker red), as well as some statistics that display the users total edits, and percentage of edits to the entry over time. All of this information is browsable, so for example, you can click on a user and get their WikiDashboard:

Via the user dashboard, you can see the user’s edits over time, periods of editing intensity, and links to their contributions. This is very useful for getting a quick glance of the user’s editing interests over time.

Compared to Virgil’s awesome Wikiscanner, there’s less of a “Wow” factor for WikiDashboard, but I actually think a tool like WikiDashboard presents significantly more utility, and is the beginning of an interesting trend of repurposing metadata to create a trust heuristic. The inventors of WikiDashboard, Ed Chi and Bongwon Suh, describe this as “Social Transparency.”

The tool’s creators are fighting a good fight. There is great information in Wikipedia, and to discount the quality of the information solely on the editorial policy is to throw away the baby with the bathwater. That said, Wikipedia is susceptible to the introduction of systematic bias, and lesser-patrolled entries are sometimes left vandalized for long periods of time. WikiDashboard doesn’t really solve any of the inherent problems of Wikipedia, but it does provide users with graphical approximation of the type of editing going on for any entry – and this is valuable.

Let’s imagine an example. You visit an entry, and notice that a number of editors have a fairly consistent editing pattern over time. However, recently, a new editor has taken up the page, and is heavily editing the site. You’d be able to ascertain this quickly with the graph, and these visual cues might indicate that you should cast a more critical gaze at the quality of the article. In essence, this cue information provides a quick, graphical reference to inform our critical reading skills in Wikipedia.

What is interesting in both WikiScanner and WikiDashboard’s case is that this information is mined directly (and without “bias”) from the metadata information available in Wikipedia. The data, always publicly available, was simply transformed into a more human-digestible format, and we can use this data to inform our understanding of the quality of the article. What’s interesting to me is where this metadata approach might lead us; there’s a lot that can go into our quality estimate of the article…it’s quite fascinating.

The authors have asked for feedback on WikiDashboard, so here are my ideas:

  1. I’d like to be able to sort my WikiDashboard by number of edits, and recency of edits; i.e. let me visualize who is currently editing the article, and their editing histories.
  2. I like the bias-free approach (just listing edits w/o algorithmic intervention), but there are a good number of cues the WikiDashboard people could add by processing the editing histories with an algorithm. I’d like to get cues about how often an editor’s changes are reverted, how often an editor’s changes elicit discussion or administrative intervention – these data points could provide information about the quality of the edits a user contributes.
  3. Right now WikiDashboard is somewhat slow; the authors could implement a diff-based caching system that would speed things up a bit. If WikiDashboard became really snappy I think I’d actually use it for all my Wikipedia browsing.

If you’d like to check out WikiDashboard, here’s a link to the project homepage, and a direct link to the Wikipedia installation. The team has blogged about the product, and witten an additional blog post about how to understand the graphs. Very interesting stuff…check it out.


5
Mar 07

A Closer Look at Candidate Wikipeida Entries

Following up on my previous post on Wikipedia’s influence in candidate search results, I thought I’d take a look at each candidate Wikipedia entry. What follows is a graph that explores how frequently the candidate’s entry has been edited since 1/1/2007(1), how many times it has been reverted, claimed to be vandalized, and who is the entry’s most frequent editor. Some interesting findings follow.

Candidate Edits Since 1/1 (1) Reverts (2) Vandal (3) Top Editor (4)
Barack Obama 1397 191 20 HailFire (167)
Hillary Clinton 534 63 10 Wasted_Time_R (91)
Joe Biden 282 13 5 Andyvphil (28)
John Edwards 248 38 5 Jersyko (18)
Bill Richardson 204 15 4 Diluvial (12)
Dennis Kucinich 188 12 2 Amonk (20)
Christopher Dodd 167 5 1 Haus42 (45)
Mike Gravel 119 1 0 DavidYork71 (35)
Dem. Averages 392.4 42.3 5.9
Rudy Giuliani 583 46 4 Wasted_Time_R (125)
John McCain 482 62 15 204.193.6.90 (17)
Mitt Romney 413 39 14 Yellowdesk (37)
Sam Brownback 363 27 4 Getaway (106)
Ron Paul 215 17 1 SlamDiego (15)
Mike Huckabee 182 18 3 A.J.A. (26)
Tom Tancredo 152 12 4 SirAndrew1 (36)
Duncan Hunter 84 4 0 Victoria2007 (10)
Tommy Thompson 38 2 1 Ultimatecoolguy (7)
Rep. Averages 279.1 25.2 5.1

Note: Candidates with “locked” entries are bolded.

What can we learn from this graph? By far and away, Barack Obama has the most frequently updated Wikipedia entry. Of course, the entries of Edwards and Clinton are both locked, so that will contribute to their decreased update status.

Here’s what I found interesting. First, the percentage of reverts – about 10 percent of changes seem to be being reverted. A significantly less percentage of changes are outright vandalism.

What is also interesting is the average number of changes per day. In the 64 days since 1/1, Democratic Wikipedia pages have been changed an average of 6.125 times a day (or less than once every 4 hours), and Republican Wikipedia pages are changed an average of 4.36 times a day (or less than once every 6 hours).

Also interesting are the top “editors” of the pages. For one candidate, this top editor is responsible for 29% of all changes to the page since 1/1. You can browse the top editors of the candidate pages and learn a little bit more about them by clicking the links above.

Footnotes and Methdology:
(1) 1/1/2007 was an arbitrary choice of a date, designed to give all candidates an equal baseline for analysis.
(2) These are the total claimed reverts on the history page.
(3) These are the total times vandalism is claimed on the history page.
(4) Top editor of the page since 1/1, (Total edits)

This survey represents a one-time analysis of the change history of candidate Wikipedia pages. It was run on 3/5/2007, and the data was analyzed with simple custom-written software. The “reverts” and “vandalism” numbers are based on self-reports, there was no content analysis.


26
Feb 07

Wikipedia's Expansive Influence in Candidate Search Results

In a recent survey, I found that Wikipedia has an expansive influence in organic Google search results for 2008 presidential candidates. For each candidate, their Wikipedia entry is ranked no lower than 5th place by Google. In addition, the Wikipedia entry ranks higher than the election web presence of that particular candidate for 25% of Democrats and 60% of Republicans. There is no other entity on the web that plays such a systematically influential role in candidate information positioning as Wikipedia, pointing to its increased importance as a messaging tool in the 2008 cycle. A full breakdown of candidate search result positions follows:

Candidate Main Site Rank (1) Election Site Rank (2) Wikipedia Rank (3) Outrank? (4)
John Edwards 1 1 3 N
Joe Biden 1 3 5 N
Christopher Dodd 1 4 3 Y
Mike Gravel 1 1 3 N
Dennis Kucinich 3 1 5 N
Barack Obama 3 1 2 N
Bill Richardson 2 4 1 Y
Hillary Rodham Clinton 1 2 3 N
Sam Brownback 1 3 4 N
Rudy Giuliani 2 2 1 Y
Duncan Hunter 1 2 3 N
Mitt Romney 1 1 2 N
Jim Gilmore x (5) x (5) 1 Y
Mike Huckabee 2 2 1 Y
John McCain 1 x (5) 3 Y
Ron Paul 1 5 3 Y
Tom Tancredo 1 3 4 N
Tommy Thompson 2 4 1 Y

This is truly eye-opening data. Wikipedia’s influence is systematic and pervasive, perhaps to the point of overreaching. Should Wikipedia outrank a candidate’s electoral site? Clearly, this shows that monitoring Wikipedia is a must for every campaign – thankfully Wikipedia makes this easy with RSS-based monitoring.

Wikipedia’s role in the 2008 cycle will be interesting to follow. Over the next few months, I’ll be looking at candidate Wikipedia presence and attempting to make some sense of the possibilities.

Caveats about this data and methodology: This represents a one-time analysis of Google search results. These results may and will change over time. The queries were directed to Google.com, from a US-based location. Other Google national sites may provide dissimilar results. Queries were constructed exactly as transcribed – i.e. no quotes around names, or special techniques.

Footnotes:
(1) – This is the search rank of the candidate’s main site, if the candidate has a main site different from their electoral web presence. For example, John McCain or Dennis Kucinich’s Congressional web presence.
(2) – This is the search rank of the candidate’s electoral web presence, the home of their presidential campaign or their exploratory committee.
(3) – This is the search rank of the candidate’s main Wikipedia entry.
(4) – An “Outrank” is declared if the Wikipedia page outranks the candidate’s electoral web presence.
(5) – A result was not found in the top ten search results.


21
Sep 06

Wikipedia’s test for academic blogging

In January, I published a post that contained the findings of a study I did on Facebook. That post, entitled “Student Life on the Facebook” was widely read – it may be the reason you follow my blog today. Since then, I’ve posted other findings from my studies, in blog form. I’ve presented these findings in conferences and journals, and I’m currently writing them up into a comprehensive journal article.

Shortly after Student Life on the Facebook was posted, someone added it to Wikipedia’s entry about Facebook. A few times a day, someone clicked through from Wikipedia to my blog, and found my research on Facebook.

On August 16, the Wikpedia registered editor L1AM, leaving the comments “tweaking”, “cleanup” and “thining (sic) out links”, edited the Facebook Wikipedia entry, removing the link to my research. On that day, L1AM removed over 25 links to a variety of sources, from personal weblogs to mainstream media sources. L1AM also removed a number of paragraphs from the entry, contributing to a one-day edit in which 15% of the article was removed [1].

As Wikipedia was a very small fraction of my traffic, I didn’t notice that Wikipedia wasn’t linking to me until a few days ago. When I checked the entry and saw the edits, I was frankly surprised. While a number of blog entries remained, including a humorous piece on Facebook Etiquette by CollegeHumor.com, my work was deleted. No longer would people researching the Facebook via Wikipedia stumble on to my research (and publications).

While L1AM did not cite a particular reason for the deletion, I wanted to explore the potential rationale behind his decision. My particular case may be somewhat unique, but the notion of impartial academics posting real research data to blogs is hardly novel. Consider how many times a day a blog post passes through your newsreader with empirical statistical data – this shows that blogs are becoming a method of research dissemination.

Wikipedia presents a clear set of guidelines as to what is considered a reliable source. As you might imagine, this covers a wide variety of source areas; since we’re concentrating on blogs, here is the blog (self-published source) policy:

A self-published source is a published source that has not been subject to any form of independent fact-checking, or where no one stands between the writer and the act of publication. It includes personal websites, and books published by vanity presses. Anyone can create a website or pay to have a book published, and then claim to be an expert in a certain field. For that reason, self-published books, personal websites, and blogs are largely not acceptable as sources.

Exceptions to this may occur when well-known professional researchers self-publish within their fields of expertise or when well-known professional journalists publish their own material. In some cases, these may be acceptable as sources, so long as their work has been previously printed in credible third-party publications and they are writing under their own names and not under pseudonyms.

However, editors should exercise caution for two reasons: first, if the information on the professional researcher’s blog (or self-published equivalent) is really worth reporting, someone else will have done so; secondly, the information has been self-published, which means it has not been subject to any independent form of fact-checking.

In general it is preferable to wait until other sources have had time to review or comment on self-published sources.

Reports by anonymous individuals, or those without a track record of publication to judge their reliability, do not warrant citation at all, until such time as it is clear that the report has gained cachet, in which case it can be noted as a POV.

I’ll parse this a little. Blogs are generally not considered an acceptable source for Wikipedia entries, unless:

  • The blog is written by a well-known, professional researcher writing within his or her filed of expertise, as long as their work (though not the work in question) has been previously published by credible, third-party publications.
  • The blog is written under the researchers own name, and not a psuedonym
  • Other sources have reviewed or commented on the blog topic

Wikipedia’s policy is broad and general, as it should be – but the generality presents difficult confounds, especially for early-stage academics. The initial test of well-known – what does this mean in academia? And who is the judge of this? Furthermore, what is the notion of peer-review in the blogosphere? This is a critical question. Every day, hundreds of people receive my posts in their feedreaders, and it is only once in a while someone calls me out as a fool. Since I am not being called out, have I passed peer review? Or is peer-review operationalized only when a blogger with a higher Technorati rank links to me?

I feel it is critical to at least ponder these issues because the fact stands that young academics will be blogging research in the future. They will do it to share early-stage findings, to find new colleagues, to expand their networks – at the same time, sharing valuable, usable data. Will Wikipedia turn a dark eye to this growing corpus of valuable research?

Of course, my article wasn’t deleted because it was blog research. The editor L1AM left other blogs, which suggests his purge was oriented towards stuff he didn’t like, get, or feel was worthwhile. Looking at L1AM’s edits, it is remarkable how much of Facebook’s history was edited out that day. If you’ll allow that my research was valuable to the Wikipedia entry, we can clearly see a flaw in this editing process.

The flaws I see are twofold. First, and foremost, it is poor editorship. L1AM decided to hack up the Facebook article one day, he didn’t leave much in terms of justification – he just did it. According to Wikipedia policies and philosophies, the magnitude of change he made should have been accompanied by a good deal of discussion, commenting, and documentation. None of this occurred, which would seem to be poor editorship.

The second flaw is a little more nuanced, and it deals with how the community regulates the editing process. Wikipedia articles grow over time. I believe that for a majority of articles, the majority of content is added up front, with new content adds and deletes spread out over time, creating the logarithmic long-tail we’re used to seeing. One can think of this process as a honing in, or “getting right” – edits should become smaller and smaller as time progresses.

When L1AM removed 15% of the Facebook article, he did a massive scale edit. The edits clustered around L1AM’s edits were much, much smaller – single word or line changes. If we are to assume that the 15% of the article that L1AM removed had been vetted by the community, how would the community would respond to this significant loss? As it happens, the community simply went on, and L1AM’s changes were not challenged. This leads to my question – is there a quantitative metric we can place to determine failures in Wikipedia’s process? If the editorial process allows 15% deletions on strong, established articles, does this mean the editorial process has failed? Shouldn’t edits be getting smaller over time?

Of course, this is interesting to think about in the wake of Citizendium. While I’m spending good wage-earning years of my life being poor to get a degree that says I’m an expert, I’m definitely one that believes in the validity of crowd work. Terrell Russell’s has a good middle ground on the expertise in Wikipedia issue – though nothing short of the Citizendium solution would deal with the breakdown in the editing process observed on the Facebook article.

Of course, there are a couple things to note. The Facebook article might be an outlier on Wikipedia due to its popularity. The fluctuations and lack of editorial control may simply be from the fact that popular articles break Wikipedia’s model – as opposed to the model being inherently flawed. The communities around less popular articles may demonstrate more of the classic, long-tail characteristics. I’d love to see some data on this.

Getting back to the original topic at hand, I’d also love to see how many blog citations there are on Wikipedia. I can’t think of any good way to extract a rough ballpark. I would assume that blog appearance on Wikipedia clusters around emergent phenomena – the Facebook being a prime example.

As we go forward, many more of us will be sharing valuable, good research on a variety of topics via our blogs. Sure, that research will eventually go to print and die a lonely death inside a non-accessible digital library – but should Wikipedia keep its head in the sand to the research until that point? While traditional scholars may argue one side, the realities of information needs and flows, especially about emergent phenomena, may create an imperative for posting early research to blogs. Will Wikipedia change its policies to adapt accordingly?

[1] Based on pre- and post-edit line counts of content in Wikipedia Facebook article. I’m not sure this is a great metric, so I am open to better suggestions.