User:Gwern/Archive-bot.hs

{- Module : Main.hs

License : public domain

Maintainer : Gwern Branwen

Stability : unstable

Portability : portable

Functionality: retrieve specified articles from Wikipedia and request WebCite to archive all URLs found.

USE: Print to stdin a succession of Wikipedia article names (whitespace in names should be escaped as '_').

A valid invocation might be, say: '$echo Fujiwara_no_Teika Fujiwara_no_Shunzei | archive-bot'

All URLs in Fujiwara no Teika and Fujiwara no Shunzei would then be backed up.

If you wanted to run this on all of Wikipedia, you could take the current 'all-titles-in-ns0'

gzipped file from WP:DUMP, gunzip it, and then pipe it into archive-bot.

TODO: send an equivalent request to the Internet Archive.

Not in any way rate-limited.

BUGS: Issues redundant archive requests.

Currently uses Data.ByteString.Lazy.Char8. If I'm understanding the documentation right, this barfs

on the full UTF-8 character set, but Wikipedia definitely exercises the full UTF-8 set. I *would* use

Data.ByteString.Lazy, but that doesn't have 'lines', 'unlines', and 'words'. Need to ask #haskell/Dons

what's up. -}

module Main where

import Monad (liftM)

import Control.Concurrent (forkIO)

import Text.HTML.TagSoup (parseTags, Tag(TagOpen))

import Text.HTML.Download (openURL)

import qualified Data.ByteString.Lazy.Char8 as B (ByteString(), getContents, lines, unlines, pack, unpack, putStrLn, words)

import Data.List (isPrefixOf)

import Data.Set (toList, fromList)

main :: IO ()

main = mapM_ (forkIO . archiveURL) =<< (liftM sortNub $ mapM fetchArticleText =<< (liftM B.words $ B.getContents))

where sortNub :: B.ByteString -> [B.ByteString]

sortNub = toList . fromList . concat

fetchArticleText :: B.ByteString -> IO [B.ByteString]

fetchArticleText article = liftM (B.lines . extractURLs) (openURL(wikipedia ++ B.unpack article))

where wikipedia = "http://en.wikipedia.org/wiki/"

extractURLs :: String -> B.ByteString

extractURLs arg = B.unlines [B.pack x | TagOpen "a" atts <- (parseTags arg), (_,x) <- atts, "http://" `isPrefixOf` x]

archiveURL :: B.ByteString -> IO ()

archiveURL url = do B.putStrLn url -- Note that the use of forkIO means only some URLs will print

openURL("www.webcitation.org/archive?url=" ++ (B.unpack url) ++ emailAddress)

return ()

where emailAddress = "&email=foo@bar.com"