User:Wherebot/Source

Here is the latest code as of 5/5/2007. Unicode does not work with at as of writing.

Here is the source code. This has only been tested on UNIX-like systems, but it should theoretically also work on Windows. Note that the code was not intended for wide distribution, so it is not well-commented. Sorry! Also note that the code requires wget, [http://pywikipediabot.sf.net pywikipediabot] ,[http://developer.yahoo.com Yahoo's python search plugin], perl , and the [http://search.cpan.org/dist/Bot-BasicBot/lib/Bot/BasicBot.pm Bot::BasicBot] and [http://search.cpan.org/~nwclark/perl-5.8.8/lib/IPC/Open2.pm IPC::Open2] perl modules. You may use the code under the GNU General Public License.

If you want to modify Wherebot to run on a different wiki or language, there are some modifications that need to be made. I have marked where people may want to do so on lines containing the text "#CONFIG."

Please go into edit mode to see the source of the program with proper linebreaks.

Here is the main file, cv-watch.pl. Place it where you wish:

!/usr/bin/perl

use strict;

some of the IRC parts of this bot are based off of the Bot::BasicBot sample code

Wherebot->new(channels => ["#en.wikipedia", "#en.wikiversity"], nick=>"Wherebot4", server => "irc.wikimedia.org")->run(); #CONFIG: change Wherebot4 to something unique

package Wherebot;

use base qw/Bot::BasicBot/;

use IPC::Open2;

sub said {

shift(); #don't care about the first parameter

our %hash = %{shift()};

our $rawMessage = $hash{"body"};

our $channel = $hash{"channel"};

our $site = $channel;

$site =~ s&#&&;

$rawMessage =~ m#02(http://$site.org[^ ]+)#;

our $url = $1;

CONFIG: the next four lines are to ignore certain pages. Customize if you like

if ($url =~ /[Tt]alk:/) {return;}

if ($url =~ /Sandbox/) {return;}

if ($url =~ /Articles for deletion/) {return;}

if ($url =~ /Wikipedia:Introduction/) {return;}

chop $rawMessage;

if ($rawMessage =~ /N\x{03}10/) {

CONFIG: the next four lines are to ignore certain namespaces. Customize if you like.

if ($url =~ /User:/) {return;}

if ($url =~ /Wikipedia:/) {return;}

if ($url =~ /Portal:/) {return;}

if ($url =~ /Help:/) {return;}

if ($url =~ /Template:/) {return;}

if ($url =~ /Category:/) {return;}

if ($url =~ /Image:/) {return;}

&act($channel, $url);

}

sub URLDecode { #From http://glennf.com/writing/hexadecimal.url.encoding.html

my $theURL = $_[0];

$theURL =~ tr/+/ /;

$theURL =~ s/%([a-fA-F0-9]{2,2})/chr(hex($1))/eg;

$theURL =~ s///g;

return $theURL;

}

sub act {

our $misc = "/home/where/misc";

our $channel = shift;

our $url = shift;

$url =~ s#'##g; #just in case, although this would never be necessary

chop $url;

our $term = `wget '$url?action=raw' -q -O - | head -n 1`;

chomp $term;

our $origUrl = $url;

$url =~ m#/wiki/(.*)#;

our $page = $1;

$url .= "?action=raw";

$url =~ s#'##g; #shouldn't be a problem, but hey, I'm paranoid

chomp $term;

$term = &trim($term); #get it to <100 words so yahoo doesn't go crazy

if ($term =~ /#redirect/i) {

return;

}

if ($term =~ /^\{/) {

return;

}

if ($term =~ /^

return;

}

$term =~ s#'''##g;

$term =~ s#''##g;

$term =~ s#\[\[##g;

$term =~ s#\]\]##g;

$term =~ s#\*##g;

$term =~ s#"##g; #Yahoo chokes on quotes; yes, this will probably return false matches, but it is better than the alternative

$term =~ s#\(##g;

$term =~ s#\)##g;

if (m#([^]+)[]#) { #same thing with parenthesis
$term = $1;
}

if (length($term) < 75) {

return;

}

our $firstLine;

our $n=0;

while (1) {

our $pid = open2(*Reader, *Writer, "python", "$misc/search.py", "-t", "web", '"' . $term . '"'); #CONFIG: CHANGE $misc/search2.py to the path to search.py from the Yahoo search API

$firstLine = ;

# print "($url): FL: $firstLine\n";

if ($firstLine =~ /Internal WebService error, temporarily unavailable/ || $firstLine =~ /^Got an error/) {

warn "Search failed; retrying\n";

sleep 60;

waitpid $pid, 0;

++$n;

if ($n < 3) {

next;

}

else {

last;

}

else {

waitpid $pid, 0;

last;

}

if (!($firstLine =~ /^No results\s*/)) {

;; #skip some lines

our $from = ;

$from =~ s#\s##g;

if ($from =~ m#^http://en\.wikipedia\.org# || $from =~ m#\.gov# || $from =~ m#^http://en.wikibooks#) {

return;

}

#Get the page in the proper format

$page = &URLDecode($page);

$page =~ s#_# #g;

our $strippedUrl = $from;

$strippedUrl =~ s#^http://##;

#print "($page) copyvio from $from\n";

if ($channel eq "#en.wikipedia") { #CONFIG: change this line according to your language and version

chdir "$misc/pywikipedia"; #CONFIG: change this line according to where your pywikipedia directory is

}

print "Writing\n";

open APPEND_PY, "|nice -n 10 python append.py";

print APPEND_PY "* $page -- [$from $strippedUrl]. Reported at ~~~~~";

close APPEND_PY;

}

sub trim { #cut parameter to <100 words

our $in = shift;

our @in = split / /, $in;

our $out = "";

our $i = 1;

for (@in) {

$out .= $_ . " ";

++$i;

if ($i == 99) {

last;

}

chop $out; #get rid of last space

return $out;

}

The following file, append.py, should go in the pywikipediabot directory.

!/usr/bin/python

import wikipedia

import sys

site = wikipedia.getSite()

page = wikipedia.Page(site, "User:Where/Sandbox") #CONFIG: Change page

text = page.get()

text = unicode(text + "\n") + unicode(raw_input(), 'utf8')

wikipedia.setAction("Adding a suspected copyright violation") #CONFIG: change edit summary

page.put(text,minorEdit=False)

You need a user-config.py file in the pywikipediabot dir. Here's mine:

mylang='en' #CONFIG: change for your wiki language

usernames['wikipedia']['en']='Wherebot' #CONFIG: change for your wiki, wiki language and username

maxthrottle=2

put_throttle=3

Now run login.py in the pywikipediabot dir.

Finally, run cv-watch.pl.