:Wikipedia:Bots/Requests for approval/DumZiBoT
{{Newbot|DumZiBoT}} Automatic or Manually Assisted: Both. I choose at every run. Programming Language(s): Python. pywikipedia framework Function Summary: Idea from Wikipedia:Bot requests#Incorrect Ref Syntax Edit period(s) (e.g. Continuous, daily, one time run): Every time that a new dump is available. Edit rate requested: I_don't_know. Standard pywikipedia throttle settings. Edit : from my test runs on fr:, from 5 to 10 edits per minute Already has a bot flag (Y/N): N. Function Details: Read User:DumZiBoT/refLinks. Please feel free to correct English mistakes if you find any, it is intended to be a runtime FAQ ;) The script has been maually tested on fr, where I already have a botflag for :fr:Utilisateur:DumZiBoT and ~40k automated edits. From the ~20 edits that I've made overthere, I've found several exceptions, which are fixed by now. Sidenotes: Do you have an estimate of the total number of pages to be edited in the first full run on enwiki? Does your parser for the dumps match things with extra spaces such as :I can easily count this. I currently have troubles getting the last en: dump, so it will have to wait...tomorrow I'd say, but as an estimate, the number of pages to alter on fr is ~5500, not that much. For your second question, the answer is yes. NicDumZ ~ 23:05, 29 December 2007 (UTC) ::Count from I did another run on fr, longer this time : [http://fr.wikipedia.org/w/index.php?title=Special:Contributions&offset=20071229233742&limit=70&target=DumZiBoT]. I had some rare encoding problems ([http://fr.wikipedia.org/w/index.php?title=%C3%89thanol&diff=prev&oldid=24492907]) which I need to fix, but everything seems fine to me. NicDumZ ~ 00:07, 30 December 2007 (UTC) I was thinking about doing similar to this for a while, except my bot would have been tailored to placing {{tl|cite news}} with most parameters (date, title, author) filled from the website? Any similar ideas? And will you be placing a comment notify editors that the title was automatically added (so they aren't left wonder why some of the links have strange titles)? I look forward to seeing your source code. —Dispenser (talk) 08:47, 30 December 2007 (UTC) I would definitely be interested to read through the code. I think that using {{tl|cite news}} isn't a good idea, since in most cases you won' be able to fill in the detail. You could add the string "Accessed YYYY-MM-DD" after the link, inside the ref tags, without much trouble. — Carl (CBM · talk) 15:37, 30 December 2007 (UTC) :The {{tl|cite news}} template would have to coded for each specific site. The ideal form would be to store this in a dictionary which would allow easy adding of new sites. Having proper citation would immensely help with dead link problem in citations. ::Ah ! No offense, but you don't seem to know what you're talking about. My bot will have to edit thousands of different websites : would you write a different handler for each website ? xD NicDumZ ~ 19:09, 30 December 2007 (UTC) :::No, but I would write it for the 20 biggest that regularly remove content after a few month, especially those that block access to the Wayback Machine via the robots.txt. Some which I'd like to see are New York Times, Yahoo News, Reuters, The Times, Los Angeles Times. Most of these will probably use the same regex anyway. —Dispenser (talk) 20:02, 30 December 2007 (UTC) ::::I checked yesterday for NYTimes links. Happens that (rare) links which looks like http://www.nytimes.com/* could be easily parsed to retrieve an author name and a publication date. But the huge majority of other links http://select.nytimes.com/*, for example, don't appear to have a common format, nor to give the author on every article... NicDumZ ~ 08:11, 3 January 2008 (UTC) :::::I will have to concede on this point, as of the aforementioned problem. I did come up with an idea of a user based system, but as of now it is currently unimplementable. —Dispenser (talk) 06:56, 7 January 2008 (UTC) An [http://fr.wikipedia.org/w/index.php?title=H%C3%A9ro%C3%AFne&diff=prev&oldid=24540345 issue] with JavaScripts in the HTML. Will : Thanks for this one. I saw it, but thought that it was some strange title. About the problem, I'm a bit... stuck. (By the way the source of the [http://membres.lycos.fr/afghanainfo/act_03.08.2001.1.htm page], is... so wrong ! ) I don't think that I should ignore text inside
[[User:DumZiBoT|DumZiBoT]]
= Discussion =
?
— Carl (CBM · talk) 21:50, 29 December 2007 (UTC)enwiki-latest-pages-articles.xml
(October 23th, pages-meta is newer, but still downloading it) : ~62,300. Quite a lot in fact. NicDumZ ~ 14:52, 30 December 2007 (UTC)
accessdate
parameter could be stated. (how would you retrieve an author
parameter ?) Does that worth it ?
into
, which yields title] -->. —Dispenser (talk) 20:02, 30 December 2007 (UTC)
markups and inside comments. Thanks ! NicDumZ ~ 22:17, 30 December 2007 (UTC),
application/xml
mime types be accepted? What about server that don't sent out any types? —Dispenser (talk) 10:03, 31 December 2007 (UTC)