User:Matt Crypto/RandomArticles
While Wikipedia has a [http://en.wikipedia.org/wiki/Special:Randompage Random page] feature, the pages are selected uniformly randomly from the database. As an alternative, I wrote a script to choose pages randomly based on their hit counts for a month; such a set might give a more representative example of how Wikipedia looks to visitors. The hit data for, say, September 2004 can be found [http://wikimedia.org/stats/en.wikipedia.org/url_200409.html here] (warning: very large file). Below is an example from the hits so far this month (to 22nd September 2004). If you would like a set, just send me a message and tell me a Wikipedia page, and I'll run the script for you and paste in the output. — Matt 15:06, 21 Sep 2004 (UTC)
100 randomly-selected articles (weighted by popularity)
- Embryophyte — (51 hits)
- CRAF — (36 hits)
- Congressional committee — (35 hits)
- Snake — (870 hits)
- Linux distribution — (687 hits)
- Plate Carrée Projection — (20 hits)
- Place Manner Time — (77 hits)
- Sestertius — (124 hits)
- Stargate SG-1 — (661 hits)
- Moorhead, Minnesota — (24 hits)
- MOHAA — (26 hits)
- Ian Stuart — (6 hits)
- Readelf — (42 hits)
- Sidney James — (38 hits)
- Jacques Derrida — (808 hits)
- Edgar Degas — (1668 hits)
- Strategic bombing — (270 hits)
- The Kingston Trio — (77 hits)
- Zoophilia — (11612 hits)
- United States Senate — (2472 hits)
- Women's Social and Political Union — (73 hits)
- Prostaglandin — (596 hits)
- Painters — (208 hits)
- Archeology of Algeria — (16 hits)
- Nyota — (10 hits)
- Nikkei Index — (17 hits)
- Norway — (2809 hits)
- Coefficient — (86 hits)
- Chinese mantis — (46 hits)
- Triple — (107 hits)
- Minor characters from The Hitchhiker's Guide to the Galaxy — (780 hits)
- History of Seattle — (79 hits)
- Dawes Rolls — (66 hits)
- John Stewart — (40 hits)
- Puberty — (1573 hits)
- Electrical resistance — (806 hits)
- Sophia — (225 hits)
- Hydroponic (album) — (5 hits)
- Biafran War — (218 hits)
- Halloween documents — (170 hits)
- Squad Automatic Weapon — (47 hits)
- Carl Wayne — (210 hits)
- British Forces Germany — (53 hits)
- Beslan hostage crisis — (14675 hits)
- Craigieburn — (12 hits)
- Spot (Star Trek) — (131 hits)
- Smart (automobile) — (941 hits)
- Microscope — (3498 hits)
- Time value of money — (117 hits)
- George Jackson — (50 hits)
- Clarence — (21 hits)
- Communication with submarines — (789 hits)
- Macaulay Culkin — (597 hits)
- Jade Emperor — (194 hits)
- Jimbo Wales — (514 hits)
- Round Table — (146 hits)
- Arizona State University — (606 hits)
- List of regions of the United States — (1173 hits)
- King's College, Cambridge — (165 hits)
- Rhythmic gesture — (17 hits)
- Longest word in English — (1405 hits)
- Condorcet method — (806 hits)
- Total Recall — (214 hits)
- Shawn Michaels — (334 hits)
- Conjunction fallacy — (142 hits)
- 2004 Summer Olympics medal count — (1747 hits)
- Pizza — (695 hits)
- Ambisonics — (4 hits)
- Paul Neil Milne Johnstone — (212 hits)
- HMS Albion (1802) — (4 hits)
- Contagious magic — (5 hits)
- Phase velocity — (124 hits)
- IWW — (120 hits)
- Vegetarian — (355 hits)
- Schlong — (26 hits)
- Auschwitz Album — (3970 hits)
- GameFAQs — (1317 hits)
- Meteorology — (554 hits)
- Connotation — (537 hits)
- Oral sex — (7430 hits)
- 1969 — (1749 hits)
- Nucleic acid — (452 hits)
- Alcohol — (1846 hits)
- Uluru — (376 hits)
- EMac — (136 hits)
- Montagu Island — (30 hits)
- Black Panther — (153 hits)
- Orlando Letelier — (192 hits)
- Godwin's law — (6776 hits)
- Tybee Bomb — (2609 hits)
- Spaced — (78 hits)
- BAC 1-11 — (61 hits)
- 1974 in film — (234 hits)
- Relational model — (609 hits)
- Property — (508 hits)
- Glasgow — (704 hits)
- Nicotine — (408 hits)
- Rear Window — (177 hits)
- Texas Air National Guard controversy — (166 hits)
- Football World Cup 1974 — (85 hits)
==Script==
import re
from random import *
logFile = "/tmp/url_200409.html"
maxEntries = None # 10000
numberOfArticles = 100
r1 = re.compile(r'^(\d*)\s*([0-9.]*)%\s*([0-9]*)\s*([0-9.]*)%\s*/wiki/(\S*)$')
class ArticlePicker:
def __init__(self, logFile, maxEntries = False):
self.logFile = logFile
self.hitList = []
self.count = 0
self.maxEntries = maxEntries
def readLogFile(self):
F = open(self.logFile)
count = 0
self.hitSum = 0
for l in F:
if self.maxEntries and count > self.maxEntries:
break
try:
hits, name = self.parseLine(l)
except ValueError:
continue
count = count + 1
self.hitList.append((hits,name))
self.hitSum += hits
self.count = count
F.close()
self.hitList.sort()
self.hitList.reverse()
def parseLine(self, line):
l = line.strip()
m = r1.match(l)
if m == None: raise ValueError, "No matches found"
(hits, t1, t2, t3, name) = r1.match(l).groups()
self.filterOut(hits, name)
spaceName = re.sub('_', ' ', name)
return int(hits), spaceName
def filterOut(self, hits, name):
if name == "": raise ValueError # Exclude blank
if re.match(r'^\w*:', name): raise ValueError # Exclude namespaces
if re.match(r'Main_Page', name): raise ValueError # Exclude main page
# Exclude popular oddities
if re.match(r'_vti_bin/owssvr.dl|MSOffice/cltreq.asp', name): raise ValueError
def selectRandomly(self, N = 1):
rHits = [random() * self.hitSum for i in range(N)]
outputs = [None] * N
numberOfOutputs = 0
totalSoFar = 0
for hits, name in self.hitList:
totalSoFar += hits
for index in range(N):
if not outputs[index] and totalSoFar >= rHits[index]:
outputs[index] = hits, name
numberOfOutputs += 1
if numberOfOutputs == N: return outputs
return outputs
# Dump the articles
H = ArticlePicker(logFile, maxEntries)
H.readLogFile()
randomArticles = H.selectRandomly(numberOfArticles)
print "==%d randomly-selected articles (weighted by popularity)==" % numberOfArticles
for hits, name in randomArticles:
print "* %s — (%d hits)" % (name, hits)