User:Opabinia regalis/Article statistics#Recent mainspace changes survey

{{historical}}

Note

The datasets below are old (2006-7), tiny, and not useful except as a historical reference.

Random article survey

I was bored waiting for my very slow program to run, so I clicked "random article" 250 times and kept track of what kinds of articles popped up. 48 articles (19.2%) were stubs or had at least one cleanup tag. (I tried to count "citation needed" as a cleanup tag but may have missed a few.) The results as of 11 Nov 2006:

class="wikitable"

! Type of article

! Number

! Percent of sample

Biography

| 60

| 24%

Places/geographical locations

| 34

| 13.6%

TV shows/movies

| 17

| 6.8%

Disambiguation

| 15

| 6%

Music/bands/albums

| 14

| 5.6%

Company/product/service

| 13

| 5.2%

History/war

| 12

| 4.8%

Politics/government

| 9

| 3.6%

Sports

| 8

| 3.2%

Organisms

| 8

| 3.2%

Definitions/common phrases/common objects

| 7

| 2.8%

Architecture/buildings

| 7

| 2.8%

Mythology/religion

| 5

| 2%

Astronomy/physics/space science

| 5

| 2%

Software/computing

| 5

| 2%

Games (including video)

| 4

| 1.6%

Literature/publications

| 4

| 1.6%

Biology/medicine

| 3

| 1.2%

Food/drink

| 3

| 1.2%

Schools

| 3

| 1.2%

Math

| 2

| 0.8%

Nonsense/unclassifiable

| 2

| 0.8%

Visual arts

| 2

| 0.8%

Philosophy/ethics

| 2

| 0.8%

Linguistics/languages

| 2

| 0.8%

Charities/nonprofit organizations

| 2

| 0.8%

Economics/finance

| 1

| 0.4%

Deleted and protected

| 1

| 0.4%

"Biography" is probably a bit overinflated because I classified everything about an individual real person as a biography, including historical figures. Articles about fictional characters went in the category of the corresponding fiction (TV, myth, etc.)

Obviously this is a lousy way to determine Wikipedia coverage - 250 articles is a tiny sample. But the advantage over, say, counting category populations is that this avoids duplicate-counting of articles in multiple categories and can find articles that are un- or miscategorized. Special:Random also (as far as I know) excludes recently created articles that haven't yet been indexed, which filters out lots of nonsense speedy candidates. I don't think Special:Random would exclude deletion candidates, but none of these had prod or AfD templates.

First-glance observations:

  • I didn't find a single chemistry article. Biology and medicine had one clinical feature, one cell biology article, and one disease, so not even any biochemistry showed up. Physics as such was also missing; the articles in that category were almost entirely about NASA missions and observations.
  • Similarly, nothing I'd classify as sociology or psychology.
  • The literature and publications category contains a comic book, a newspaper, and two contemporary novels. No classic/canon literature.
  • I admit I'm a bit surprised at the low volume of school articles, which judging from AfD are infesting the place like weeds.
  • Somehow I don't think that 14% of the sum of all human knowledge is TV, movies, games, and bands. I admit I was surprised at the low percentage of video game cruft. The music articles were almost exclusively contemporary popular bands or their albums (with reasonably diverse geographical coverage) - nothing about musical theory and nothing about classical music.

Recent mainspace changes survey

Inspired by Wikipedia:Wikipedia is failing and User:Worldtraveller/Wikipedia is failing (NB: leaving the redlink, in case further moves occur), I looked at a sample of 250 mainspace edits covering a time span of 04:43 to 04:46 UTC on 18 Feb 2007. (It would be interesting to gather these statistics again at a time when US schools are in session.) In this sample there were 159 edits by registered users, 89 edits by anonymous users, and 2 edits to a subsequently deleted image description page. Thus the percentages below take 248 edits as the total sample.

class="wikitable"
Change type

! Percent of total sample (n = 248)

! Percent by registered editors (n = 248)

! Percent by anonymous editors (n = 248)

! Percent of all registered edits (n = 159)

! Percent of all anonymous edits (n = 89)

Substantial content changes

| 5.2%

| 4.0%

| 1.2%

| 6.3%

| 3.4%

Minor content changes

| 28.6%

| 17.3%

| 11.3%

| 27.0%

| 31.5%

Copyediting/formatting/wikilinking

| 40.7%

| 27.4%

| 13.3%

| 42.8%

| 37.1%

Tagging/maintenance

| 8.5%

| 6.5%

| 2.0%

| 10.1%

| 5.6%

Vandalism reversion

| 8.9%

| 7.3%

| 1.6%

| 11.3%

| 4.5%

Vandalism

| 8.1%

| 1.6%

| 6.5%

| 2.5%

| 18.0%

Other than determining whether an edit was vandalism, I did not make any value judgments. Thus, 'minor content changes' contains considerable amounts of unsourced material and original research that will certainly be reverted.

Other observations:

  • I saw two ongoing edit wars and one addition of an inappropriate unfree image.
  • Of the ten examples of substantial content changes by registered users, five were new-page creations. The single largest content change was on a Digimon article.
  • One of the four examples of vandalism by registered users involved the creation of a nonsense page.
  • I excluded bot-flagged edits (the default). The registered-editor set contains two edits by a known-bot account without a bot flag.
  • The percentage of copyediting and formatting done by registered editors is probably inflated by AWB users.

General thoughts:

  • I suppose it's a good sign that the rate of vandalism and the rate of vandalism reversion are about the same. However, that could be a function of the time of day.
  • Substantial content addition occurs at a quite low rate. It's possible that this is due to editing patterns: if an editor uses many 'progressive saves', no one change will appear on this sort of survey as substantial, and if an editor uses a single save for a large change, that editor's edit rate will be low and his change will be unlikely to appear in such a small sample. I didn't see much evidence of the first pattern, in that no series of edits to the same article by the same person occurred except to manipulate formatting; however, a series of content-creating edits will likely be separated by more than three minutes.

Category:Random pages tests