Wikipedia:Wikipedia Signpost/2013-02-04/Special report#For anti-vandalism.2Fdamage

{{Wikipedia:Wikipedia Signpost/Templates/RSS description|1=Examining the popularity of Wikipedia articles: On February 12, 2012, news of Whitney Houston's death brought 425 hits per second to her Wikipedia article, the highest peak traffic on any article since at least January 2010. It is broadly known that Wikipedia is the sixth most popular website on the Internet, but the English Wikipedia now has over 4 million articles and 29 million total pages. Much less attention has been given to traffic patterns and trends in content viewed.}}{{Wikipedia:Signpost/Template:Signpost-header|||}}

{{Wikipedia:Signpost/Template:Signpost-article-start|{{{1|Examining the popularity of Wikipedia articles: catalysts, trends, and applications}}}|By West.andrew.g and Milowent| 4 February 2013}}

File:Flickr Whitney Houston performing on GMA 2009 4.jpg page received 425 hits per second at its peak as news of her death spread.]] On February 12, 2012, news of Whitney Houston's death brought 425 hits per second to her Wikipedia article, the highest peak traffic on any article since at least January 2010.

It is broadly known that Wikipedia is the [http://www.alexa.com/topsites sixth most popular website] on the Internet, but the English Wikipedia now has over 4 million articles and 29 million total pages. Much less attention has been given to traffic patterns and trends in content viewed. The Wikimedia Foundation makes available aggregate [http://dumps.wikimedia.org/other/pagecounts-raw/ raw article view data] for all of its projects.

This article attempts to convey some of the fascinating phenomena that underlie extremely popular articles, and perhaps more importantly to editors, discusses how this information can be used to improve the project moving forward. While some dismiss view spikes as the manifestation of shallow pop culture interests (e.g., Justin Bieber is the 6th most popular article over the past 3 years, see Tab. 2), these are valuable opportunities to study reader behavior and to shape the public perception of our projects.

= The origins of heightened popularity =

Articles which are "extremely popular" on Wikipedia fall into the category of either (1) occasional or isolated popularity, or (2) consistent popularity.

The prime sources of occasional or isolated popularity include:

class="wikitable" style="float:right; margin:1em auto 1em 3em;"

|+ style="text-align:right; caption-side:top;"| Tab. 1. The most viewed pages on Wikipedia in a one hour period, since January 1, 2010 (excluding duplicate entries and DOS attacks)

Rank

! Article

! Date (UTC)

! Views/hr

! Views/sec

! Notes

1

|Whitney Houston

|12 Feb 2012

|1532302

|425.6

|Death of subject

2

|Amy Winehouse

|23 Jul 2011

|1359091

|377.5

|Death of subject

3

|Steve Jobs

|6 Oct 2011

|1063665

|295.5

|Death of subject

4

|Madonna (entertainer)

|6 Feb 2012

|993062

|275.9

|Super Bowl halftime

5

|Osama bin Laden

|2 May 2011

|862169

|239.5

|Death of subject

6

|The Who

|7 Feb 2010

|567905

|157.8

|Super Bowl halftime

7

|Ryan Dunn

|20 Jun 2011

|522301

|145.1

|Death of subject

8

|Jodie Foster

|14 Jan 2013

|451270

|125.4

|Golden Globes speech

+

  • Cultural events and deaths: The best way to reach the highest levels of Wikipedia popularity are to be a celebrity who (a) dies, or (b) plays the Super Bowl halftime show (see Tab. 1). This year's Super Bowl entertainment, Beyoncé Knowles, just missed the chart with 100–110 views/second. Generally, prominent deaths dominate the top-100 traffic events and beyond. However, less morbid events are occasionally on the same scale, such as Jodie Foster following her recent coming out at the 2013 Golden Globes, Bubba Watson upon winning the 2012 Masters Tournament, and Ice hockey at the 2010 Winter Olympics during the final match between the U.S. and Canada (all drew over 250,000 views in a single hour).
  • Google Doodles: Google often replaces its logo to commemorate anniversaries and other events, and clicking on the logo will usually produce the search results for that topic. With Wikipedia appearing first for many search engine queries, this can be a tremendous source of traffic. When the 110th birthday of Dennis Gabor was celebrated in this fashion on June 5, 2010, his article peaked at over 55 views per second (this for an article that currently sees only about 140 views per day). There are many other examples, including Winsor McCay on October 15, 2012, Gideon Sundback on April 24, 2012, and the London Underground last month.
  • Non-human views and DOS attacks: Page access data cannot distinguish between human and automated attackers. The most dramatic example occurred on March 9, 2010, when the Jyllands-Posten Muhammad cartoons controversy article saw 5.3 million views in a single hour (likely the densest view-hour at any point in Wikipedia's history). Due to the religious controversy/sensitivity surrounding the topic, this is believed to be an attack designed to prevent others from viewing the page and its associated imagery. Ironically, the Denial of Service article also appears to be a frequent target. Often, it can be hard to distinguish between malicious attacks, accidental misconfiguration (e.g. bot testing), and undiscovered catalysts of human traffic. In compiling the WP:5000/Top25Report, some discretion is applied to attempt to remove odd anomalies. For example, Cat anatomy has been a popular article in raw page views for a few months (and not only on Caturdays), after previously being much less popular.
  • Second screen effect: Though not nearly on the scale of the above spikes, we find that television programs and their content are reflected in page view data. This can be as broad as spikes on the Big Bang Theory article when the program airs on popular networks, but is even seen in small traffic bumps when a quiz show like Jeopardy! or Who Wants to be a Millionaire? asks about a particular topic. This phenomenon has recently been more thoroughly investigated on the German Wikipedia.(17 November 2012). [http://www.martinrycak.de/wikipedia-zugriffszahlen-bestatigen-second-screen-trend/ Wikipedia-Zugriffszahlen bestätigen Second-Screen-Trend], martinrycak.de (in German, article investigates how Wikipedia traffic matches German television shows during broadcast times) ([http://translate.google.com/translate?sl=auto&tl=en&js=n&prev=_t&hl=en&ie=UTF-8&eotf=1&u=http%3A%2F%2Fwww.martinrycak.de%2Fwikipedia-zugriffszahlen-bestatigen-second-screen-trend%2F&act=url English translation])

  • Slashdot effect: When extremely popular aggregation sites like Slashdot or Reddit prominently link to Wikipedia, traffic follows. Internally, Wikipedia's Main page can have much the same effect.
  • Temporal patterns: The Christmas article is popular in December, Easter peaks around that holiday, and Christianity-related articles tend to see unusual amounts of Sunday traffic. This is just the start of patterns which are reflected diurnally, annually, and at other pre-determined intervals.

{{-}}

class="wikitable" style="float:right; margin:1em auto 1em 3em;"

|+ style="text-align:right; caption-side:top;"| Tab. 2. The most popular articles on Wikipedia (2010–2012)

Rank

! Article

1

|Wiki

2

|Facebook

3

|United States

4

|YouTube

5

|Google

6

|Justin Bieber

7

|Glee (TV series)

8

|Sex

9

|Wikipedia

10

|Lady Gaga

11

|Eminem

12

|How I Met Your Mother

13

|United Kingdom

14

|The Big Bang Theory

15

|India

16

|World War II

Meanwhile, reasons for long-term popularity are somewhat more intuitive. Tab. 2 shows the most popular articles over the last ~3 years. In addition to the broad underlying cultural and academic interests of Wikipedia's audience, we encourage the reader to consider:

  • English Wikipedia's readership is not representative of English speaking populations. Previous studies have shown that Wikipedia's readership tends to be somewhat young, male, and educated—and their interests are likely to vary accordingly. Anecdotal evidence suggests significant traffic is driven by primary/secondary/university students in academic contexts, and we find that related topics are frequent vandal targets as wellWest, Andrew G., Sampath Kannan, and Insup Lee. Detecting Wikipedia Vandalism via Spatio-Temporal Analysis of Revision Metadata. In EUROSEC ‘10: Proceedings of the Third European Workshop on System Security, pp. 22–28. Paris, France. April 2010. [http://dl.acm.org/citation.cfm?id=1752046.1752050 (@ACM)] – [http://repository.upenn.edu/cis_papers/428/ (Author's version available for download)] (e.g., classic English literature, trigonometry concepts, etc.).
  • Notice that Google, YouTube, and Facebook are all consistently popular articles. We speculate this is due in part to people accidentally typing these site names/URLs into a Wikipedia search box (either in the Mediawiki interface or a web browser) when intending to actually visit the sites themselves; related to, but not a case of typosquatting.

= Applications and use-cases of the data =

== For anti-vandalism/damage ==

The impetus behind storing these statistics was to better understand damage response on Wikipedia (the dissertation topic of author User:West.andrew.g). By storing statistics for every article at the finest granularity possible (hourly), it becomes possible to accurately estimate the number of readers who saw any particular article version. While practical writings have often focused on the time to revert of damaging edits, we argue that the quantity of persons who view it is the more relevant metric. Vandalism that survives for days on an obscure article is effectively harmless if no one visits that article.

Fig. 1 plots the CDF of both the lifespan and view count of about 500,000 recent damaging edits. As the graph shows, at median just 1 person will be exposed to a damaging edit. Such an impressive figure is a testament to the automated (e.g. ClueBot NG) and semi-automated (e.g., Huggle and STiki) mechanisms that have recently been brought to bear on the task. While these tools produce probabilistic measures of damage, only STiki will soon integrate an article's popularity into its prioritization schema.

File:Wikipedia_damage_survival_(time_and_views).png plot of survival times and view counts for a corpus of damaging revisions, e.g., about 50% of damage has a lifespan of < 100 seconds, and 90% of damage has < 100 views.]]

Fig. 1 also shows that ~10% of damaging edits are viewed by 100+ persons. Deeper analysis shows that many of the associated survival times are quite short, and these are often the result of damage to extremely popular articles. With the human latency already quite minimal (and a certain amount of latency being inherent), new solutions are needed. Consider that spammers could opportunistically target very popular pages to exploit these brief windows of opportunity.

West, Andrew G. Jian Chang, Krishna Venkatasubramanian, Oleg Sokolsky, and Insup Lee. Link Spamming Wikipedia for Profit. In CEAS '11: Proceedings of the Eighth Annual Collaboration, Electronic Messaging, Anti-Abuse, and Spam Conference, pp. 152–161, Perth, Australia. September 2011. – [http://dl.acm.org/citation.cfm?id=2030376.2030394 (@ACM)] – [http://repository.upenn.edu/cis_papers/470/ (Author's version available for download)] Dynamically and autonomously moving articles in and out of "page protection" or "pending changes" based on their traffic patterns is another possible use-case for this data. As Fig. 2 demonstrates, the power-law distribution of views over articles would suggest relatively few articles need to be protected to have significant impact.

Spam and vandalism are surface-level issues. Recent analysis of deleted revisions on English Wikipedia showed copyright violations, being much harder to detect in casual patrolling work, to have significant lifespans and end-user exposures. West, Andrew G. and Insup Lee. What Wikipedia Deletes: Characterizing Dangerous Collaborative Content. In WikiSym '11: Proceedings of the Seventh International Symposium on Wikis and Open Collaboration, pp. 25–28, Mountain View, CA, USA. October 2011. – [http://dl.acm.org/citation.cfm?id=2038558.2038563 (@ACM)] – [http://repository.upenn.edu/cis_papers/478/ (Author's version available for download)] This finding has motivated research into autonomous means of copyright violation discovery (see WP:Turnitin).

{{-}}

== Improving article quality ==

File:Wikipedia_view_distribution_by_article_rank.png is also plotted for comparison.]]

Article popularity can also be a measure for deciding which articles to improve, a concept already familiar to WikiProjects who keep tabs on the popularity of articles within their project (e.g., Wikipedia:WikiProject Songs has a watchlist for the 1,500 most popular song articles). At the aggregate level, the distribution of page views follows a "power law distribution". Fig. 2 represents one months' views on Wikipedia graphed against a Zipf distribution (a distribution where the most frequent item will occur approximately twice as often as the next item, three times as often as the third item, and so forth.)

The top 25 most viewed pages represent 4% of all total views, and the top 5000 represent 19% of all views. Though the distribution has an extremely long tail, the top 5000 data provides an opportunity to locate popular but poorly written articles that need attention, as opposed to randomly selecting one of the 4.15 million remaining articles on the project. That is not to say that articles deep in the long tail are less important, but for editors interested in prioritizing article improvement based on popularity and effect on public perception, the WP:5000 data is an important tool.

= Data details and alternative perspectives =

All the statistics in this article were produced by aggregating [http://dumps.wikimedia.org/other/pagecounts-raw/ raw data] made available by the WMF. This data contains hourly hit data on a per article basis for all WMF language/project combinations. Since Jan. 1, 2010 User:West.andrew.g has been parsing these files nightly and storing the English Wikipedia (article namespace) portions to a database hosted at the University of Pennsylvania. This is a non-trivial undertaking, consuming 1TB+ yearly. In addition to being the basis for several academic results (and motivated by earlier third-party workPriedhorsky, Reid, Jilin Chen, Shyong (Tony) K. Lam, Katherine Panciera, Loren Terveen, and John Riedl. Creating, Destroying, and Restoring Value in Wikipedia. In GROUP '07: Proceedings of the International ACM Conference on Supporting Group Work, pp. 259–268, Sanibel Island, Florida, USA. November 2007. – [http://dl.acm.org/citation.cfm?id=1316663 (@ACM)]), he has more recently begun publishing the aforementioned weekly reports of the top 5000 articles, made available [http://www.andrew-g-west.com/docs/wp_page_views_2012.zip monthly reports for 2012], and released the source code behind these computations.

Others have used the same data for alternative purposes: User:Henrik has developed a [http://stats.grok.se/ tool] for looking up the traffic history of specific articles. The Wikitrends site concentrates on dramatic popularity increases/decreases. WMF analyst Erik Zachte produces [http://stats.wikimedia.org/ WikiStats], which provides a higher-level perspective on all WMF projects in numerous statistical dimensions. Mr. Zachte also has a fascinating [http://infodisiac.com/ portfolio of his WMF statistical work]. These Wikipedia/WMF-specific resources complement other Internet-scale observations regarding search and popularity; most famously the [http://www.google.com/zeitgeist/2012/#the-world Google Zeitgeist].

There are some caveats in interpreting this data. First, this is a raw presentation of traffic and popularity. It is known that

[http://stats.wikimedia.org/EN/PlotPageviewsEN.png English Wikipedia traffic has generally been increasing over time] (per [http://stats.wikimedia.org/EN/SummaryEN.htm]). This fact, and the growing Internet connectivity that likely underlies it, lends some bias to more recent events. Second, it should be mentioned that logs may have [http://infodisiac.com/blog/2010/07/wikimedia-page-views-some-good-and-bad-news/ under reported page view data in early 2010].

= References =

{{reflist|30em}}

{{Wikipedia:Signpost/Template:Signpost-article-comments-end||2013-01-28|2013-02-11}}

04 Special