Wikipedia:Wikipedia Signpost/2012-06-25/Recent research

{{Wikipedia:Signpost/Template:Signpost-header|||}}

{{Wikipedia:Signpost/Template:Signpost-article-start|{{{1|Edit war patterns, deleters vs. the 1%, never used cleanup tags, authorship inequality, higher quality from central users, and mapping the wikimediasphere}}}|By Tilman Bayer, Piotr Konieczny, Evan and Daniel Mietchen| 25 June 2012}}

=Dynamics of edit wars=

File:Time evolution of the controversy measure of Michael Jackson.png as quantified on the basis of reverted edits to his Wikipedia article. A: Jackson is acquitted on all counts after five month trial. B: Jackson makes his first public appearance since the trial to accept eight records from the Guinness World Records in London, including Most Successful Entertainer of All Time. C: Jackson issues Thriller 25. D: Jackson dies in LA.]]

"Dynamics of Conflicts in Wikipedia"{{Cite journal | last1 = Yasseri | first1 = Taha | last2 = Sumi | first2 = Robert| last3 = Rung | first3 = András| last4 = Kornai | first4 = András| last5 = Kertész | first5 = János| editor1-last = Szolnoki | editor1-first = Attila | title = Dynamics of Conflicts in Wikipedia | doi = 10.1371/journal.pone.0038869 | journal = PLOS ONE | volume = 7 | issue = 6 | pages = e38869 | year = 2012 | pmid = 22745683| pmc = 3380063| arxiv = 1202.3643 | bibcode = 2012PLoSO...738869Y | doi-access = free }} {{Open access}} develops an interesting "measure of controversiality", something that might be of interest to editors at large if it were a more widely popularized and dynamically updated statistic. The paper analyzes patterns of edit warring over Wikipedia articles. The authors conclude that edit warriors are usually willing to reach consensus, and that the rare cases of never-ending warring are those that continually attract new editors who have not yet joined the consensus.

The authors' decision to exclude from the study articles with under 100 edits because they are "evidently conflict-free" is questionable. Articles with fewer than 100 edits have been subject to clear, if not overly long, edit warring. A recent example is Concerns and controversies related to UEFA Euro 2012. It is also unfortunate that "memory effects" – a term mentioned only in the abstract and lead, and which the authors suggest is significant in understanding the conflict dynamic – is not explained in the article. The term "memory", by itself, appears four times in the body, but is not operationalized anywhere.

A press release accompanied the paper, entitled "[http://phys.org/news/2012-06-wikipedia-wars-dynamics-conflict-emergence.html Wikipedia 'edit wars' show dynamics of conflict emergence and resolution]". An MSNBC tech news headline misleadingly, but sensationally, summarized it as "[http://www.technolog.msnbc.msn.com/technology/technolog/wikipedia-editorial-warzone-says-study-838793 Wikipedia is editorial warzone, says study]".

=Who deletes Wikipedia?=

In a recent blog post by Wibidata, an analytics startup based in San Francisco, the authors set out to shed light on the often-quoted claim that most of Wikipedia was written by a small number of editors, noting other editorial patterns along the way.[http://www.wibidata.com/2012/06/06/who-deletes-wikipedia Who Deletes Wikipedia?], June 6, 2012. Using the entire revision history of English Wikipedia (they wanted to show that their platform can scale), the authors looked at the distribution of edits across editor cohorts, grouped by number of total edits. They found that from a pure count perspective, the most active 1% of editors had contributed over 50% of the total edits. (see original plot [http://www.wibidata.com/wp-content/uploads/2012/06/PercentOfRevisionsByEdGroup.png here])

In response to the suggestion that the strongly skewed distribution of edits might just be due to a core set of editors who primarily make only minor formatting modifications, they looked at the net number of characters contributed by each editor. Grouping editors by total number of edits as before, they showed an even more strongly skewed distribution, with the top 1% contributing well over 100% of the total number characters on Wikipedia (i.e. an amount of text that is larger than the current Wikipedia) and the bottom 95% of editors deleting more on average than they contributed ([http://www.wibidata.com/wp-content/uploads/2012/04/PercentOfDeltaByEditingActivity.png original plot]). Next, the authors separated logged in users from non-logged in "users" (identified only by IP addresses) and recomputed the distribution of net character contributions. By edit-count cohort, logged-in users tended to contribute significantly more than their anonymous counterparts, and non-logged-in users tended to delete significantly more ([http://www.wibidata.com/wp-content/uploads/2012/04/AveDeltaByEditActivity.png original plot]).

In summary, low-activity and new editors, along with anonymous users, tend to delete more than they contribute; this reinforces the notion that Wikipedia is largely the product of a small number of core editors.

="Wikipedia Academy" preview=

Various conference papers and posters from the upcoming "Wikipedia Academy" (hosted by the German Wikimedia chapter from June 29 to July 1 in Berlin) are already [http://wikipedia-academy.de/2012/wiki/Accepted_Submissions available online]. A brief overview of those which are presenting new research about Wikipedia:

  • {{Citation needed}} more effective than {{unreferenced}}: "On the Evolution of Quality Flaws and the Effectiveness of Cleanup Tags in the English Wikipedia"Maik Anderka, Benno Stein and Matthias Busse: [http://wikipedia-academy.de/2012/w/images/f/f0/13_Paper_Maik_Anderka_Benno_Stein_Matthias_Busse.pdf On the Evolution of Quality Flaws and the Effectiveness of Cleanup Tags in the English Wikipedia (PDF)] {{Open access}} shows "that inline tags are more effective than tag boxes" in tagging article flaws so that they get remedied. The researchers also "reveal five cleanup tags that have not been used at all, and 15 cleanup tags that have been used less than once per year", recommending their deletion, and "ten cleanup tags that have been used, but the tagged flaws have never been fixed." Similar to a paper reviewed in the April issue of this report (One in four of articles tagged as flawed, most often for verifiability issues"), they find that "the majority (71.62%) of the tagged articles have been tagged with a flaw that belongs to the flaw type Verifiability".
  • A paper titled "The Power of Wikipedia: Legitimacy and Territorial Control"Iolanda Pensa: [http://wikipedia-academy.de/2012/w/images/8/8f/17_Paper_Iolanda_Pensa.pdf The Power of Wikipedia: Legitimacy and Territorial Control (PDF)] {{Open access}}, is "based on the experience of the projects WikiAfrica (2006-2012) and Share Your Knowledge (2011-2012)", and looks at various aspects of Wikipedia, Wikimedia chapters and the Foundation through the lens of "anthropological, african and post-colonial studies."
  • "Individual and Cultural Memories on Wikipedia and Wikia, Comparative Analysis"Simeona Petkova: [http://wikipedia-academy.de/2012/w/images/2/2d/28_Paper_Simeona_Petkova.pdf Individual and Cultural Memories on Wikipedia and Wikia, Comparative Analysis (PDF)] {{Open access}} looks at the coverage of the late British DJ John Peel on Wikipedia and Wikia, respectively, as well as the Wikipedia article about the 1980s.
  • An "Extended Abstract", "Latent Barriers in Wiki-based Collaborative Writing"Alexander Mehler, Christian Stegbauer and Rüdiger Gleim: [http://wikipedia-academy.de/2012/w/images/8/8f/12_Paper_Alexander_Mehler_Christian_Stegbauer_R%C3%BCdiger_Gleim.pdf Latent Barriers in Wiki-based Collaborative Writing (PDF)] {{Open access}} compares the collaborative process "25 special-purpose wikis" (most of them hosted by Wikia) with that of the German Wikipedia. One observation of the work in progress is a "strong divide between extracts of Wikipedia (even if being reduced to single articles and their one-link neighborhoods) on the one hand and special purpose wikis on the other."
  • Two Brazilian authors will examine "the climate change controversy through 15 articles of Portuguese Wikipedia".Bernardo Esteves and Henrique Cukierman: [http://wikipedia-academy.de/2012/w/images/c/c6/5_Paper_Bernardo_Esteves_Henrique_Cukierman.pdf The climate change controversy through 15 articles of Portuguese Wikipedia (PDF)] {{Open access}} The paper contains various quantitative results about the edit history of these articles, some of them unsurprising ("A very strong positive correlation (0.994) was found between the number of edits and the number of editors of an article"). Using the framework of actor–network theory, the authors conclude that "the collaborative encyclopedia is enrolled as an ally for the mainstream science and becomes one of its spokespersons."
  • Historical infobox data: An article by four authors from Google Switzerland and the Spanish National University of Distance Education (UNED) observesGuillermo Garrido, Enrique Alfonseca, Jean-Yves Delort and Anselmo Peñas: "[http://wikipedia-academy.de/2012/w/images/7/7c/31_Paper_Guillermo_Garrido_Enrique_Alfonesca_Jean-Yves_Delort_Anselmo_Penas.pdf Extracting Wikipedia Historical Attributes Data]" (PDF) {{Open access}} that "much research has been devoted to automatically building lexical resources, taxonomies, parallel corpora and structured knowledge from [Wikipedia]", often using the structured data present in infoboxes (which they say are present in "roughly half" of English Wikipedia articles). However, this research has so far used only snapshots representing the state of articles at a particular point in time, whereas their project embarked to extract "a wealth of historical information about the last decade ... encoded in its revision history." The resulting 5.5GB dataset, called "Wikipedia Historical Attributes Data (WHAD)", will be made freely available for download.
  • Better authorship detection, and measuring inequality: Two researchers from the University of Karlsruhe will present an algorithmFabian Flöck and Andriy Rodchenko: [http://wikipedia-academy.de/2012/w/images/2/24/23_Paper_Fabian_Fl%C3%B6ck_Andriy_Rodchenko.pdf Whose article is it anyway? – Detecting authorship distribution in Wikipedia articles over time with WIKIGINI (PDF)] {{Open access}} to detect which user wrote which part of a Wikipedia article. Similar to a new revert-detection algorithm presented in a recent paper co-authored by one of the present authors (see last month's issue: "New algorithm provides better revert detection"), one crucial part of the algorithm is to split the article's wikitext into paragraphs, analyzing them separately under the assumption "that most edits (if they are not vandalistic) change only a very minor part of an article’s content". Another part is calculating the cosine similarity of sentences that are not exactly identical. In the authors' own test, the new algorithm performed significantly better than the widely used WikiTrust/WikiPraise tool. Having determined the list of authors for an article revision and the size of each author's contribution, they then define a gini coefficient "as an inequality measure of authorship" (roughly, an article written by a single author will have coefficient 1, while one with equal contributions by a multitude of editors will have coefficient 0). They implement a tool called "WIKIGINI" to plot this coefficient over an article's history, and show a few examples to demonstrate that it "may help to spot crucial events in the past evolution of an article". The paper starts out from the assumption "that the concentration of words to just a few authors can be an indicator for a lack of quality and/or neutrality in an article", but it does not (yet) contain a systematic attempt to correlate the gini coefficient and existing measures of article quality.
  • Troll research compared: A paper by a German Wikipedian titled "Here be Trolls: Motives, mechanisms and mythology of othering in the German Wikipedia community"Moritz Braun: [http://wikipedia-academy.de/2012/w/images/3/37/20_Paper_Moritz_Braun.pdf Here be Trolls: Motives, mechanisms and mythology of othering in the German Wikipedia community (PDF)] {{Open access}} examines four academic texts about online trolls (only one of them in the context of Wikipedia), which "were compared regarding their scope, their theoretical approach, their methods and their findings concerning trolls and trolling.

;Posters

  • "Self-organization and emergence in peer production: editing 'Biographies of living persons' in Portuguese WikipediaCarlos D'Andréa: [http://wikipedia-academy.de/2012/w/images/9/9d/10_Poster_Carlos_D%27Andrea.pdf Seft-organization and emergence in peer production: editing “Biographies of living persons” in Portuguese Wikipedia (PDF)] {{Open access}}
  • "Biographical articles on Serbian Wikipedia and application of the extraction information on them"Djordje Stakic: [http://wikipedia-academy.de/2012/w/images/8/8d/14_Poster_Djordje_Stakic.pdf Biographical articles on Serbian Wikipedia and application of the extraction information on them (PDF)] {{Open access}}
  • "Wikipedia article namespace – user interface now and a rhizomatic alternative"Stephan Ligl: [http://wikipedia-academy.de/2012/w/images/3/37/19_Poster_Stephan_Ligl.pdf Wikipedia article namespace – user interface now and a rhizomatic alternative (PDF)]
  • "Extensive Survey to Readers and Writers of Catalan Wikipedia: Use, Promotion, Perception and Motivation"Marc Miquel-Ribé, David Morera-Ruíz and Joan Gomà-Ayats: [http://wikipedia-academy.de/2012/w/images/1/10/29_Poster_David_Morera-Ruiz_Marc_Miquel-Rib%C3%A9_Joan_Gom%C3%A0-Ayats.pdf Extensive Survey to Readers and Writers of Catalan Wikipedia: Use, Promotion, Perception and Motivation (PDF)] {{Open access}}

Researcher Felipe Ortega bloggedOrtega, Felipe: "[http://libresoft.es/node/564 Improving the extraction of Wikipedia data]" libresoft.es, 2012-06-03 about a new parser for Wikipedia dumps, to be integrated into "WikiDAT (Wikipedia Data Analysis Toolkit) ... a new integrated framework to facilitate the analysis of Wikipedia data using Python, MySQL and R. Following the pragmatic paradigm 'avoid reinventing the wheel', WikiDAT integrates some of the most efficient approaches for Wikipedia data analysis found in libre software code up to now", which will be featured in a workshop at the conference.

=Special issue of "Digithum" on Wikipedia research=

The open-access journal "Digithum" (subtitled "The Humanities in the Digital Era") has published a [http://digithum.uoc.edu/ojs/index.php/digithum/user/setLocale/en_US?source=%2Fojs%2Findex.php%2Fdigithum%2Fissue%2Fview%2Fn14 special issue] containing five papers about Wikipedia from various disciplines, with a multilingual emphasis (including research about non-English Wikipedias, and Catalan and Spanish versions of the papers alongside the English versions):

  • Are articles about companies too negative?: A paper titled "Wikipedia’s Role in Reputation Management: An Analysis of the Best and Worst Companies in the United States"Marcia W. DiStaso, Marcus Messner: "[http://digithum.uoc.edu/ojs/index.php/digithum/article/view/n14-distaso-messner Wikipedia’s Role in Reputation Management: An Analysis of the Best and Worst Companies in the United States]" DIGITHUM, NO 14 (2012) {{Open access}} looked at the English Wikipedia articles about the ten companies with the best and worst reputations according to the "Harris Reputation Quotient", a 2010 online survey about "perceptions for 60 of the most visible companies in America". Those 20 articles were coded, sentence by sentence, as positive, negative or neutral, and according to other "reputation attributes". Among the findings was that "the companies with the worst reputations had more negative content; they had, in fact, almost double the amount of negative content, although only slightly less positive content. Both types of companies had more negative than positive content. This indicates that even if a company is considered to have a good reputation, it is still very vulnerable to having its dirty laundry aired on Wikipedia." Another observation was that "emotional appeal is an attribute where both types of companies lacked content. It was rare for companies to have content about trust or feeling good, which only existed for the best companies" (an interesting question may be whether this is related to Wikipedia guidelines such as WP:PEACOCK). The paper appears at a time where many PR industry professionals in the US and UK argue that Wikipedia should allow them more control over the articles about their clients, and ends by highlighting the "importance of public relations professionals monitoring and requesting updates to Wikipedia articles about their companies". This conclusion resembles that of another recent study by one of the authors (DiStaso), which likewise concerned company articles, implicating a somewhat controversial conclusion about their accuracy (see the April issue of this research report: Wikipedia in the eyes of PR professionals).
  • WordNets from Wikipedia': The second paperAntoni Oliver, Salvador Climent: [http://digithum.uoc.edu/ojs/index.php/digithum/article/view/n14-oliver-climent Using Wikipedia to develop language resources: WordNet 3.0 in Catalan and Spanish] {{Open access}} describes "the state of the art in the use of Wikipedia for natural language processing tasks", including the researchers' own application of Wikipedia to build WordNet databases in Catalan and Spanish.
  • The Wikimedia movement as "wikimediasphere": The article "Panorama of the wikimediasphere"David Gómez Fontanills: "[http://digithum.uoc.edu/ojs/index.php/digithum/article/view/n14-gomez Panorama of the wikimediasphere]" {{Open access}} gives an overview of the Wikimedia movement, proposing the term "wikimediasphere" to describe it, and explaining "the role of the communities of editors of each project and their autonomy with respect to each other and to the Wikimedia Foundation", which is seen as "the principal supplier of the technological infrastructure and also the principal instrument for obtaining economic and organisational resources". Its vision statement is presented as a summary of the aim that is "the ideological glue that binds all the players involved". The section about "the social and institutional dimension" of the sphere briefly covers the Foundation's governance and funding models, Wikimedia chapters and other recognized supporting organizations, and the various wikis and other online platforms that structure "the organisational activity": The Foundation wiki, Meta-wiki, Strategy wiki, Outreach wiki, the Wikimedia blog and the blogs of community members aggregated on Planet Wikimedia, mailing lists etc. Authored by a Wikimedian who is a member of both the Spanish chapter and the Catalan "Friends of Wikipedia" association, the paper is remarkably well-informed and up-to-date, e.g. incorporating the Board resolution on "Recognized Models of Affiliations" from the beginning of April, and various other recent events such as the English Wikipedia's SOPA/PIPA blackout. The abstract uses the term "WikiProjects" in a different sense from that common among English-speaking Wikimedians, possibly a translation error.
  • Truth and NPOV: The fourth articleNathaniel Tkacz: "[http://digithum.uoc.edu/ojs/index.php/digithum/article/view/n14-tkacz The Truth of Wikipedia]" {{Open access}} by Nathanial Tkacz (one of the organizers of the "Critical Point of View"/CPOV initiative that organized three conferences about Wikipedia in 2010, see Signpost interview) sets out to "show that Wikipedia has in fact two distinct relations to truth: one which is well known and forms the basis of existing popular and scholarly commentaries, and another which refers to equally well-known aspects of Wikipedia, but has not been understood in terms of truth. I demonstrate Wikipedia's dual relation to truth through a close analysis of the Neutral Point of View core content policy (and one of the project's 'Five Pillars')."
    File:Wiki Loves Monuments 2011 uploads by country.png
  • Wiki Loves Monuments: A paper titled "Wiki Loves Monuments 2011: the experience in Spain and reflections regarding the diffusion of cultural heritage",Emilio José Rodríguez Posada, Ángel González Berdasco, Jorge A. Sierra Canduela, Santiago Navarro Sanz, Tomás Saorín: [http://digithum.uoc.edu/ojs/index.php/digithum/article/view/n14-rodriguez-gonzalez-sierra-navarro-saorin Wiki Loves Monuments 2011: the experience in Spain and reflections regarding the diffusion of cultural heritage]. Digithum, no. 14 (May, 2012), p. 94. {{Open access}} written by five Spanish Wikimedians, gives a concise overview of the photo contest as it played out in Spain last year.

=Briefly=

File:Juvenile bonobo.png (here a juvenile) is amongst the species that the Flora and Fauna finder [http://linserv1.cims.nyu.edu:48866/cgi-bin/classResults.cgi?place=Congo finds] for Congo.]]

  • Who was notable in London in the 1960s?: A Master's thesis in Computer ScienceMorton-Owens, E. G. (2012). A tool for extracting and indexing spatio-temporal information from biographical articles in Wikipedia. New York University. [http://www.cs.nyu.edu/web/Research/MsTheses/owens_emily.pdf PDF] describes "A tool for extracting and indexing spatio-temporal information from biographical articles in Wikipedia". The tool, named "Kivrin" after a time-travelling character from a science fiction novel, is [http://linserv1.cims.nyu.edu:48866/cgi-bin/index.cgi available online], and grew out of an earlier, simpler one that searches for articles about plants and animals living at a particular geographical place ("[http://linserv1.cims.nyu.edu:48866/cgi-bin/classSearch.cgi Flora & Fauna Finder]"). The author remarks that "the data is skewed, like Wikipedia itself, towards the U.S. and Western Europe and relatively recent history". A search for [http://linserv1.cims.nyu.edu:48866/cgi-bin/results.cgi?toponym=London&startdate=1960&enddate=1969&latitude=51.50853&longitude=-0.12574 the 1960s in London] brings up several Beatles-related biographies near the top. While the tool does seem to cover languages other than English (e.g. text from the Hungarian entry on Gottlob Frege appears in the [http://linserv1.cims.nyu.edu:48866/cgi-bin/results.cgi?toponym=Jena&startdate=0&enddate=2012&latitude=50.93333&longitude=11.58333 search results] for Jena, the German town), searches for Hungarian or other non-English place names (e.g. [http://linserv1.cims.nyu.edu:48866/cgi-bin/toponym.cgi?startdate=0&enddate=2012&toponym=Moszkva&nameSearch=Submit+name Moszkva] and [http://linserv1.cims.nyu.edu:48866/cgi-bin/toponym.cgi?startdate=0&enddate=2012&toponym=%26%231052%3B%26%231086%3B%26%231089%3B%26%231082%3B%26%231074%3B%26%231072%3B&nameSearch=Submit+name Москва], the Hungarian and Russian names of Moscow) yielded no results. Disambiguation is attempted by way of geocodes but far from robust - the [http://linserv1.cims.nyu.edu:48866/cgi-bin/results.cgi?toponym=Halle&startdate=0&enddate=2012&latitude=51.5&longitude=12.0 search results] for Halle, Saxony-Anhalt actually contain multiple entries referring to Halle, North Rhine-Westphalia.
  • How did people in Europe feel in the 1940s?: As described in a post in the New York Times' "Bits" blog{{cite web | url=http://bits.blogs.nytimes.com/2012/06/14/how-big-data-sees-wikipedia/ | title=How Big Data Sees Wikipedia | date=14 June 2012 }} Kalev Leetaru from the University of Illinois conducted a sentiment analysis of statements on Wikipedia connected to a particular space and time, and made the result into a video: "[http://www.youtube.com/watch?v=KmCQVIVpzWg The Sentiment of the World Throughout History Through Wikipedia]"
  • One third of the average Wikipedia consists of interwiki links: According to an analysisVrandečić, D. (2012). [http://simia.net/languagelinks/ Ratio of language links to full text in Wikipedias]" simia.net, June 2012 by Denny Vrandečić, head of the Wikidata development team, "[https://twitter.com/vrandezo/status/217192948482322432 on average, 33% of a Wikipedia is language links. In total, there are 240 Mio of them, 5GB]" (making up 5.3% of the overall text across all languages). The ratio tends to be higher on smaller Wikipedias.
  • "Central" users produce higher quality: A preprint by two Dublin-based researchers attempts "Assessing the Quality of Wikipedia Pages Using Edit Longevity and Contributor Centrality".{{cite arXiv | eprint=1206.2517 | last1=Qin | first1=Xiangju | last2=Cunningham | first2=Pádraig | title=Assessing the Quality of Wikipedia Pages Using Edit Longevity and Contributor Centrality | date=2012 | class=cs.SI }} The former uses the assumption that contributions which survive many subsequent edits tend to have a higher quality, and "measures the quality of an article by aggregating the edit longevity of all its author contributions". The second approach considers either the coauthorship network (the bipartite graph of users and the articles they have edited, used in many recent papers to grasp Wikipedia's collaboration processes) or the user talk page (UTP) network, where two Wikipedians are connected if one has edited the other's talk page. It is assumed that a user's "centrality" in one of these networks is a measure for the "contributor authoritativeness". These quality measures are then evaluated on 9290 history-related Wikipedia articles against the manual quality rating from WikiProject History. "The results suggest that it is useful to take into account the contributor authoritativeness (i.e., the centrality metrics of the contributors in the Wikipedia networks) when assessing the information quality of Wikipedia content. The implication for this is that articles with significant contributions from authoritative contributors are likely to be of high quality, and that high-quality articles generally involve more communication and interaction between contributors."
  • Familiarity breeds trust: A bachelor thesisHensel, T. (2012, March 11). Impact of duration of the search on the trust judgment of Wikipedia articles. Retrieved from http://essay.utwente.nl/61602/1/Hensel%2C_T.N.C.H._%2D_s0170860_(verslag).pdf at Twente University had 40 college students assess the trustworthiness of articles from the English Wikipedia, after a search for a piece of information in the article that was either present at the top or near the bottom of the article. The hypothesis that the longer search in the second case might affect the trustworthiness rating was rejected by the results, but it was found (consistent with other research) that "Trust was higher in articles with a familiar topic, rather than with unfamiliar topics".

=References=

{{reflist}}

{{Wikipedia:Signpost/Template:Signpost-article-comments-end||2012-05-28|2012-07-30}}

25 Research