Wikipedia:Bots/Requests for approval/Bender the Bot 2

Bender the Bot 2

[[User:Bender the Bot|Bender the Bot 2]]

{{Newbot|Bender the Bot|2}}

Operator: {{botop|Bender235}}

Time filed: 19:48, Saturday, August 20, 2016 (UTC)

Automatic, Supervised, or Manual: Automatic

Programming language(s): AutoWikiBrowser

Source code available:

Function overview: HTTP → HTTPS conversion for Google News and Google Books links

Links to relevant discussions (where appropriate): Wikipedia:Village pump (proposals)/Archive 127#RfC: Should we convert existing Google and Internet Archive links to HTTPS?

Edit period(s): one time run

Estimated number of pages affected: conservatively guessed 100k (but possibly 300k or more)

Exclusion compliant (Yes/No): Yes

Already has a bot flag (Yes/No): Yes

Function details: Since the transition of Internet Archive links to HTTPS is finished and WaybackMedic will take care of Wayback Machine, I want to now fix links to Google services, starting with Google News and Google Books. The bot should find the string

:(http[s]?:\/\/)?news\.google\.[^\/]+ and (http[s]?:\/\/)?books\.google\.[^\/]+ (see below)

:http[s]?:\/\/news\.google\.[^\/]+ and http[s]?:\/\/books\.google\.[^\/]+

replaced with

:https://news.google.com and https://books.google.com, respectively

The reasons for the change to HTTPS in general have already been elaborated in the RfC. In this particular case, note that http://books.google.com/ automatically redirects to HTTPS (ever since 2012 or so). That means links from Wikipedia (which is HTTPS by default) go HTTPS→HTTP→HTTPS, which not only is slower than HTTPS→HTTPS, but also breaks the HTTP Referrer (per [https://www.w3.org/Protocols/rfc2616/rfc2616-sec15.html#sec15.1.3 RFC 2616 §15.1.3]).

Furthermore, I wanted to combine the HTTPS move with a change in the TLD to .com, especially for those international TLD considered "sensitive" in certain regions (like .co.il in Arab countries, or .com.tw in China).

=Discussion=

Isn't (http[s]?:\/\/)?news.google\.[^\/]+ ([https://regex101.com/r/wD6cB4/1 editor]) the regex that should get replaced with https://news.google.com?--Joel Amos (talk) 18:34, 22 August 2016 (UTC)

:Yes it is. Sorry, I had that wrong. Fixed above. Thanks. --bender235 (talk) 19:01, 22 August 2016 (UTC)

::That's fine. Also, the brackets aren't needed around the "s" and a backward slash should precede the first "." (my bad). Also, you'll want to remove the trailing slash from the replacement string so that it doesn't change news.google.com/hello to news.google.com//hello edit: beat me to it :D --Joel Amos (talk) 19:39, 22 August 2016 (UTC)

:::Fixed the backslash (although it worked fine when I tested it). --bender235 (talk) 19:53, 22 August 2016 (UTC)

::::An un-escaped dot means "any character," so the old regex would've matched false positives (e.g. news@google.com).--Joel Amos (talk) 02:09, 23 August 2016 (UTC)

:::::Fair enough. --bender235 (talk) 14:35, 23 August 2016 (UTC)

:What now? Should I have a trial run of 100 articles like with the previous Internet Archive conversion? --bender235 (talk) 23:39, 26 August 2016 (UTC)

::This may require multiple round of trials (hopefully increasing in size). Please run a short trial and post the initial results below. Please include in all summaries either a link to this BRFA trial or other ways for concerned editors to easily know what was going on and make a reply. — xaosflux Talk 02:51, 27 August 2016 (UTC)

:{{BotTrial|edits=50}} — xaosflux Talk 02:51, 27 August 2016 (UTC)

::{{BotTrialComplete}} Results are in {{user|Bender the Bot}} edit history. Found one issue, on E. R. Cowell: the Regex not only caught the URL, but also the pseudo-URL in the |publisher= parameter and crippled the rest of the citation template (ran manually, didn't save). Best solution would be to have things like |publisher=Books.google.ca replaced with |via=Google Books (obviously Google Books is not the publisher of the books). Or, and that is the easier option for now, make the http:// in the Regex non-optional, so that it only replaces true URLs. Actually, I suggest the latter to keep this bot as simple as possible. --bender235 (talk) 22:53, 27 August 2016 (UTC)

:::{{t1|BAG assistance needed}} So, any further requests or can this bot go live? --bender235 (talk) 20:56, 6 September 2016 (UTC)

:{{u|Bender235}} Due to the huge size of your bot run, I'd like you to run a longer trial to give more opportunity for any odd issues to come up and get caught by other editors. — xaosflux Talk 04:43, 15 September 2016 (UTC)

:{{BotExtendedTrial|edits=600}} — xaosflux Talk 04:43, 15 September 2016 (UTC)

::Fair enough. --bender235 (talk) 14:16, 15 September 2016 (UTC)

::{{BotTrialComplete}}. Didn't spot any unusual behavior. --bender235 (talk) 15:49, 15 September 2016 (UTC)

:{{BotApproved}} Due to your large run size, please ramp up in stages up to the following, this will allow brief periods for unknown issues to be brought to your attention.

:#3000 edits, 24 hour pause

:#4000 edits, 24 hour pause

:#5000 edits, 24 hour pause

:#10000 edits, 24 hour pause

:#50000 edits, 24 hour pause

:#Rest of run. — xaosflux Talk 01:19, 19 September 2016 (UTC)

:The above discussion is preserved as an archive of the debate. Please do not modify it. To request review of this BRFA, please start a new section at WT:BRFA.