User:Invitatious/intindex.py

This script generates an index of Wikipedia articles by the first letter of every word. Install Python from www.python.org. Save this script as intindex.py in a new folder, and put an uncompressed title data dump (just the title list) in the same folder. Execute the Python script and enter the filename of the data dump at the prompt. In about 15 minutes (time on a 2.8GHz computer with Windows XP), it should be ready. To use the abbreviation GWB for example, open the "G" folder, then open the "GW" file in Notepad or another text editor. Perform a case-sensitive search for the abbreviation all capitalized with two spaces after it. Repeat the search until all occurrences (titles) have been found.

I allow anyone to use this script for any purpose.

import sys, os, re

print "intindex.py."

print "This script makes an index of Wikipedia articles"

print "by initials from the title list file. The list is"

print "sorted based on the first two characters of the"

print "abbreviation to reduce file size."

split_regex = re.compile(r"[^A-Za-z0-9]") # matches a word-seperating character

  1. filename_regex = re.compile(r"[^A-Za-z0-9]") # matches a character that should not be used in a filename

input_file = open(raw_input("Input filename: "), "r") # open the input file

last_filename = "" # no output file open yet

output_file = False # no output file open yet

i = 0 # make a page counter

for page_title in input_file: # for each page title in the file...

page_title = page_title.replace("_", " ") # convert raw title to display title

abbreviation = "" # get ready for a new abbreviation

title_words = split_regex.split(page_title) # split into words

for word in title_words: # for each word in the title...

if len(word) > 0: # if the word is not blank...

abbreviation += word[0].upper() # get the first letter and capitalize it

if len(abbreviation) > 2: # if the abbreviation is 2 letters long or more...

# abbreviation = filename_regex.sub("_", abbreviation) # change unallowed characters

output_dir = abbreviation[0:1] # build path

if not last_filename == abbreviation[0:2]: # if this goes in a different file...

if output_file: # if a different output file is open...

output_file.close() # close it

if not os.path.exists(output_dir): # if the output path doesn't exist...

os.makedirs(output_dir) # create the directory

output_file = open(os.path.join(output_dir, abbreviation[0:2]), "a") # open file for appending

output_file.write(abbreviation + " " + page_title) # write the title to the file

last_filename = abbreviation[0:2]

i = i + 1 # add to page counter

if i % 5000 == 0: # if divisible by 5000...

print "%04dK processed" % (i // 1000) # show status

input_file.close() # close the input file