:Wikipedia:Bots/Requests for approval/DumZiBoT

DumZiBoT 1

:The following discussion is an archived debate. Please do not modify it. Subsequent comments should be made in a new section.

[[User:DumZiBoT|DumZiBoT]]

Operator: NicDumZ ~

Automatic or Manually Assisted: Both. I choose at every run.

Programming Language(s): Python. pywikipedia framework

Function Summary: Idea from Wikipedia:Bot requests#Incorrect Ref Syntax

Edit period(s) (e.g. Continuous, daily, one time run): Every time that a new dump is available.

Edit rate requested: I_don't_know. Standard pywikipedia throttle settings. Edit : from my test runs on fr:, from 5 to 10 edits per minute

Already has a bot flag (Y/N): N.

Function Details: Read User:DumZiBoT/refLinks. Please feel free to correct English mistakes if you find any, it is intended to be a runtime FAQ ;)

The script has been maually tested on fr, where I already have a botflag for :fr:Utilisateur:DumZiBoT and ~40k automated edits. From the ~20 edits that I've made overthere, I've found several exceptions, which are fixed by now.

Sidenotes:

Why am I requesting approval here instead of fully testing it on fr ?
There is another concern on fr : most of the html titles I retrieve are in english, and I'm pretty sure that sooner or later, someone will have something against that. I must fully test it somewhere else, where english titles are fine, to be sure that my script is correct, technically speaking, before trying to solve any ideologic problems on fr ;)
Besides, the bot owner community on fr is *very* small, and very un-reactive during these holidays. Developement and test runs will be far more effective here.
Update : I eventually made a full run on fr.
I may use this account occasionally to perform edits using the basic pywikipedia scripts. For now, I'm thinking about running interwiki scripts on Warning files generated from the French wiki.

= Discussion =

Do you have an estimate of the total number of pages to be edited in the first full run on enwiki? Does your parser for the dumps match things with extra spaces such as
[http://google.com ] ?
— Carl (CBM · talk) 21:50, 29 December 2007 (UTC)

:I can easily count this. I currently have troubles getting the last en: dump, so it will have to wait...tomorrow I'd say, but as an estimate, the number of pages to alter on fr is ~5500, not that much. For your second question, the answer is yes. NicDumZ ~ 23:05, 29 December 2007 (UTC)

::Count from enwiki-latest-pages-articles.xml (October 23th, pages-meta is newer, but still downloading it) : ~62,300. Quite a lot in fact. NicDumZ ~ 14:52, 30 December 2007 (UTC)

I did another run on fr, longer this time : [http://fr.wikipedia.org/w/index.php?title=Special:Contributions&offset=20071229233742&limit=70&target=DumZiBoT]. I had some rare encoding problems ([http://fr.wikipedia.org/w/index.php?title=%C3%89thanol&diff=prev&oldid=24492907]) which I need to fix, but everything seems fine to me. NicDumZ ~ 00:07, 30 December 2007 (UTC)

I was thinking about doing similar to this for a while, except my bot would have been tailored to placing {{tl|cite news}} with most parameters (date, title, author) filled from the website? Any similar ideas? And will you be placing a comment notify editors that the title was automatically added (so they aren't left wonder why some of the links have strange titles)? I look forward to seeing your source code. —Dispenser (talk) 08:47, 30 December 2007 (UTC)

At first I thought about using {{tl|Cite web}}, but some French users objected that using a rather complex template when the simple wiki syntax would do was unnecessary... The sole advantage of this is that the accessdate parameter could be stated. (how would you retrieve an author parameter ?) Does that worth it ?
Your comment idea was nice. I just added the functionality, thanks =)
: Are you checking that the link isn't inside a comment block? Some editors will comment out links. —Dispenser (talk) 18:09, 30 December 2007 (UTC)
::I'm wondering if I should do it. I mean... A commented out "bad" reference might be converted into a commented out reference with an automated title. And... ? Is it a problem ? Does anyone have a problem with that ? NicDumZ ~ 19:09, 30 December 2007 (UTC)
:::I was merely asking if it would do something like this into title] -->, which yields title] -->. —Dispenser (talk) 20:02, 30 December 2007 (UTC)
::::Now I see what you mean ! I forgot that case :( To avoid that problem, now the pagetext goes through wikipedia.removeDisabledParts(), which removes text inside
```
, , and  markups and inside comments. Thanks ! NicDumZ ~ 22:17, 30 December 2007 (UTC)
```
:::::I had a serious bug, from a wrong implementation of that fonctionality (all the text inside these markups was removed!!!!). It has now been fixed. NicDumZ ~ 13:29, 31 December 2007 (UTC)

I will post my code in a subpage, before the final approval, but after a successful run-test.
My scripts now also logs http errors in a file for me to check later the concerned pages with weblinkchecker.py NicDumZ ~ 12:21, 30 December 2007 (UTC)

I would definitely be interested to read through the code. I think that using {{tl|cite news}} isn't a good idea, since in most cases you won' be able to fill in the detail. You could add the string "Accessed YYYY-MM-DD" after the link, inside the ref tags, without much trouble. — Carl (CBM · talk) 15:37, 30 December 2007 (UTC)

:The {{tl|cite news}} template would have to coded for each specific site. The ideal form would be to store this in a dictionary which would allow easy adding of new sites. Having proper citation would immensely help with dead link problem in citations.

::Ah ! No offense, but you don't seem to know what you're talking about. My bot will have to edit thousands of different websites : would you write a different handler for each website ? xD NicDumZ ~ 19:09, 30 December 2007 (UTC)

:::No, but I would write it for the 20 biggest that regularly remove content after a few month, especially those that block access to the Wayback Machine via the robots.txt. Some which I'd like to see are New York Times, Yahoo News, Reuters, The Times, Los Angeles Times. Most of these will probably use the same regex anyway. —Dispenser (talk) 20:02, 30 December 2007 (UTC)

::::I checked yesterday for NYTimes links. Happens that (rare) links which looks like http://www.nytimes.com/* could be easily parsed to retrieve an author name and a publication date. But the huge majority of other links http://select.nytimes.com/*, for example, don't appear to have a common format, nor to give the author on every article... NicDumZ ~ 08:11, 3 January 2008 (UTC)

:::::I will have to concede on this point, as of the aforementioned problem. I did come up with an idea of a user based system, but as of now it is currently unimplementable. —Dispenser (talk) 06:56, 7 January 2008 (UTC)

An [http://fr.wikipedia.org/w/index.php?title=H%C3%A9ro%C3%AFne&diff=prev&oldid=24540345 issue] with JavaScripts in the HTML. Will application/xml mime types be accepted? What about server that don't sent out any types? —Dispenser (talk) 10:03, 31 December 2007 (UTC)

: Thanks for this one. I saw it, but thought that it was some strange title. About the problem, I'm a bit... stuck. (By the way the source of the [http://membres.lycos.fr/afghanainfo/act_03.08.2001.1.htm page], is... so wrong ! ) I don't think that I should ignore text inside