Wikipedia:Reference desk/Archives/Computing/Early/ParseMediaWikiDump

{{Historical}}

Parse::MediaWikiDump is a Perl module created by Triddle that makes accessing the information in a MediaWiki dump file easy. Its successor MediaWiki::DumpFile is written by the same author and also available on the CPAN.

Download

The latest versions of Parse::MediaWikiDump and MediaWiki::DumpFile are available at https://metacpan.org/pod/Parse::MediaWikiDump and https://metacpan.org/pod/MediaWiki::DumpFile

Examples

=Find uncategorized articles in the main name space=

  1. !/usr/bin/perl -w

use strict;

use Parse::MediaWikiDump;

my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";

my $pages = Parse::MediaWikiDump::Pages->new($file);

my $page;

while(defined($page = $pages->next)) {

#main namespace only

next unless $page->namespace eq '';

print $page->title, "\n" unless defined($page->categories);

}

=Find double redirects in the main name space=

This program does not follow the proper case sensitivity rules for matching article titles; see the [https://metacpan.org/pod/Parse::MediaWikiDump documentation that comes with the module] for a much more complete version of this program.

  1. !/usr/bin/perl -w

use strict;

use Parse::MediaWikiDump;

my $file = shift or die "must specify a Mediawiki dump file";

my $pages = Parse::MediaWikiDump::Pages->new($file);

my %redirs;

while(defined(my $page = $pages->page)) {

next unless $page->namespace eq '';

next unless defined($page->redirect);

my $title = $page->title;

$redirs{$title} = $page->redirect;

}

while (my ($key, $redirect) = each(%redirs)) {

if (defined($redirs{$redirect})) {

print "$key\n";

}

}

=Import only a certain category of pages=

  1. !/usr/bin/perl

use Parse::MediaWikiDump;

use DBI;

use DBD::mysql;

$server = "localhost";

$name = "dbname";

$user = "admin";

$password = "pass";

$dsn = "DBI:mysql:database=$name;host=$server;";

$dbh = DBI->connect($dsn, $user, $password);

$source = 'pages_articles.xml';

$pages = Parse::MediaWikiDump::Pages->new($source);

print "Done parsing.\n";

while(defined($page = $pages->page)) {

$c = $page->categories;

if (grep {/Mathematics/} @$c) { # all categories with the string "Mathematics" anywhere in their text.

# For exact match, use {$_ eq "Mathematics"}

$id = $page->id;

$title = $page->title;

$text = $page->text;

#$dbh->do("insert ..."); #details of SQL depend on the database setup

print "title '$title' id $id was inserted.\n";

}

}

=Extract articles linked to important Wikis but not to a specific one =

The script checks if an article contains interwikis to :de, :es, :it, :ja and :nl BUT not :fr.

It is useful to link "popular" articles to a specific wiki. It may also give useful hints about

articles that should be translated in priority.

  1. !/usr/bin/perl -w
  1. Code : Dake

use strict;

use Parse::MediaWikiDump;

use utf8;

my $file = shift(@ARGV) or die "must specify a Mediawiki dump file";

my $pages = Parse::MediaWikiDump::Pages->new($file);

my $page;

binmode STDOUT, ":utf8";

while(defined($page = $pages->next)) {

#main namespace only

next unless $page->namespace eq '';

my $text = $page->text;

if (($$text =~ /\[\[de:/i) && ($$text =~ /\[\[es:/i) &&

($$text =~ /\[\[nl:/i) && ($$text =~ /\[\[ja:/i) &&

($$text =~ /\[\[it:/i) && !($$text =~ /\[\[fr:/i))

{

print $page->title, "\n";

}

}

Related software

  • [http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep Wikipedia preprocessor (wikiprep.pl)] is a Perl script that preprocesses raw XML dumps and builds link tables, category hierarchies, collects anchor text for each article etc.
  • Wikipedia:WikiProject Interlanguage Links/Ideas from the Hebrew Wikipedia - a project in the Hebrew Wikipedia to add relevant interwiki (interlanguage) links to as many articles as possible. It uses Parse::MediaWikiDump for searching for pages without links. It is now being exported to other Wikipedias.

Category:Wikipedia tools

Notes

{{reflist}}