User talk:GreenC bot
{{ombox
| type = content
| style = border:2px solid #B22222
| image = File:Crystal Clear action exit.svg
| text = You can stop the bot by pushing the stop button. The bot sees and immediately stops running. Unless it is an emergency please consider reporting problems first to my talk page.
}}
{{Archive box|search=yes|
}}
Flagging non-dead link as dead
[https://en.wikipedia.org/w/index.php?title=Ojos_de_Mar&diff=1098932177&oldid=1077160870 This edit] flagged [https://www.academia.edu/download/30869670/Turismo_y_Territorio_en_Salta-_Caceres_et_al-_CONICET-UBA_2012.pdf this URL] as dead even though it isn't. Jo-Jo Eumerus (talk) 11:17, 18 July 2022 (UTC)
:Same with these edits:
:* https://en.wikipedia.org/w/index.php?title=Tiberius_Gracchus&oldid=1098930968
:* https://en.wikipedia.org/w/index.php?title=Caesar%27s_civil_war&oldid=1098935280
:I appreciate it probably has to do with some kind of automatic PDF link serving in Javascript that Academia.edu uses wouldn't be readily captured with a bot; I don't know how fixable it is, but the links noted are not dead at all; I reverted both edits that the bot flagged. Ifly6 (talk) 14:35, 18 July 2022 (UTC)
:
:The url that Editor Jo-Jo Eumerus linked:
:*https://www.academia.edu/download/30869670/Turismo_y_Territorio_en_Salta-_Caceres_et_al-_CONICET-UBA_2012.pdf – dead for me
:Both of the urls that Editor Ifly6 links:
:*https://www.academia.edu/download/31557049/Peter_Russell_-_Babeuf_and_the_Gracchi_(MHJ_Vol._36_(2008)__pp._41-57).pdf – dead for me
:*https://www.academia.edu/download/51344857/Iris-_Fall_of_the_Roman_Republic.pdf – dead for me
:There was some discussion about these kinds of academia links at {{slink|Wikipedia:Link_rot/URL_change_requests|www.academia.edu/download/}}
:—Trappist the monk (talk) 14:43, 18 July 2022 (UTC) 14:46, 18 July 2022 (UTC)
- Jo-Jo Eumerus & User:Ifly6 they are dead for me (USA). [https://ghostarchive.org/archive/UR1dl Example]. Are you getting a redirect to a cloudfront URL? Wondering if there is some kind of location-aware policy that determines when to serve the cloudfront URL vs a 404. If the cloudfront URL was known, it would be possible to save it at the Wayback Machine, then use the Cloudfront-Wayback URL on Wikipedia treated as a dead link (due to its &Expires self-destruct mechanism see WP:AWSURL). However, I wonder about copyright if academia.edu is making them unavailable in the US and possibly elsewhere, question why have that policy if not a rights issue. -- GreenC 15:04, 18 July 2022 (UTC)
- :I'm in the US and am getting the links promptly. The links I am getting are Cloudfront ones with an expiry; I used the Academic.edu links to avoid the known expiry. Ifly6 (talk) 15:41, 18 July 2022 (UTC)
- ::Ah I see you use British English so I assumed you are not US. What browser do you use? Do you have any plugins that might affect javascript? This is impacting archive providers as well, such as Wayback Machine and Ghostarchive (US-based), they also get 404. Archive.today it "works" (global IP pool) but they are unable to correctly save the PDF. -- GreenC 16:00, 18 July 2022 (UTC)
- :::I do get a "d1wqtxts1xzle7.cloudfront.net" sort of thing. Jo-Jo Eumerus (talk) 17:33, 18 July 2022 (UTC)
- :::Language heuristics are always right 99pc of the time haha. I've confirmed on Edge (Windows 10) and Safari (macOS) that the Academia.edu link work. I don't have any plugins installed other than ad blockers that would affect something like this. The specific link that got generated for me with Rafferty was https://d1wqtxts1xzle7.cloudfront.net/51344857/Iris-_Fall_of_the_Roman_Republic-with-cover-page-v2.pdf. There were then a pile of GET parameters that I've excerpted – they change every time anyway – but are necessary to get the file served properly. Ifly6 (talk) 19:24, 18 July 2022 (UTC)
- ::::Jo-Jo Eumerus do you use Edge or Safari? -- GreenC 19:38, 18 July 2022 (UTC)
- ::::Wikipedia:Village_pump_(technical)#academia.edu/download .. seeing if anything comes up here. -- GreenC 19:52, 18 July 2022 (UTC)
- :::::Ifly6 in the above thread someone suggested perhaps you had signed up for account on academia.edu at some point? Or some old cookies that are giving permission. One way to test is try to access from a private window. -- GreenC 20:46, 18 July 2022 (UTC)
- ::::::Yea, that's probably it. I opened it in a private window and got the 404. Ifly6 (talk) 20:57, 18 July 2022 (UTC)
- :::::::Same for me (Firefox) Jo-Jo Eumerus (talk) 21:12, 18 July 2022 (UTC)
- :::::::Cool, glad it is figured out what is causing it. My thinking is to replace the academia.edu links with a Wayback version of the cloudfront URL so it's accessible for everyone. Or second option is to use {{para|url-access|registration}} but that 404 page is confusing and will result in bots marking it dead. -- GreenC 21:30, 18 July 2022 (UTC)
User:Jo-Jo Eumerus|User:Ifly6|User:Biogeographist: Would like to propose this solution: Special:Diff/1098978075/1099315632. It's only for academia.edu/download links, which are about 1,000 on enwiki.
- academia.edu returns a 404 when a user is not registered and logged in, which is most users. It does not say "log in to access paper", rather a misleading 404 dead link page. This causes problems:
- Archive bots will determine the links are dead (404) and mark with a {{tlx|dead link}}.
- Users will be confused thinking the link is dead and not behind a registration wall.
- Should the link ever actually die for real, there would be no archive available since the Wayback Machine sees only a dead 404 page - the Wayback machine is not an academia.edu registered user.
- While possible to use {{para|url-access|registration}} this does not solve the misleading 404 problems.
- The cloudfront link is an AWS container with an &Expires self-destruct mechanism. It's where the paper is actually located (not on academia.edu which redirects to cloudfront).
- The proposal is to determine the active cloudfront link via bot magic, immediately create a Wayback Machine save of the cloudfront URL, and change the citation to the Wayback-cloudfront link. eg. Special:Diff/1098978075/1099315632
This is what I can do somewhat easily right away. There are limits due to bot design and coding efforts what can be done. -- GreenC 04:15, 20 July 2022 (UTC)
:Hmm. It seems a bit complex and I wonder if people will be deleting the "expires" part of the link. Jo-Jo Eumerus (talk) 10:22, 20 July 2022 (UTC)
::It's a complex situation. If they delete the &Expires the URL will break (404). It will break anyway, due to the Expires, that is why the archive URL version is made the primary. The archive URL is accessible to everyone - academia.edu account not required. -- GreenC 15:30, 20 July 2022 (UTC)
Unfortunately there is something preventing cloudfront pages from being saved at Wayback. Not all pages, but most. So we have a bad situation with academia.edu/download links - ideally they should be converted to a non /download/ links - but can't be done by bot requires manual searching. The /download/ links are probably originating from Google Scholar, copy-pasting. -- GreenC 15:56, 23 July 2022 (UTC)
Backlinks report
User:Certes/Backlinks/Report seems to have stopped, but User:GoingBatty/Backlinks/Report is running normally. I've not added any new backlinks recently. Can you see anything else that I may have broken? Certes (talk) 11:17, 25 July 2022 (UTC)
:It aborted for unknown reasons. I increased the memory allocation by 10x in case that is the problem. The data may be messed up from the abort. I've restarted the process and will see what happens over the next hour or so if it can recover. Worse case will just delete all the data and it will rebuild from scratch, but that will result in a missed day. -- GreenC 15:34, 25 July 2022 (UTC)
::Thanks. Let me know if I'm checking too many targets or if some produce exceptionally big reports, and I'll remove the less productive ones. Certes (talk) 15:45, 25 July 2022 (UTC)
:::It was crashing at "m" then after increasing memory made it to "v". Odd bc it should not run out of memory, and there are no error messages system or program to suggest why it's silently halting so it might be something different. I added debug statements, takes a while to replicate an hour or more. Thanks for holding. -- GreenC 04:26, 26 July 2022 (UTC)
::::Odd: "m" and "v" are early in my list, and neither they nor anything earlier have many incoming links. If it's taking an hour then we may need to remove the entries with lowest benefit per second. A few entries have never triggered a fix and could probably be removed, but I've already removed the resource-heavy ones. Maybe I need to rate them all by fixes done per 1000 incoming links or similar and chop those scoring lowest. "v" is an oddity because it can indicate that the editor failed to press {{key|Ctrl}} when pasting: easy to spot, but hard to fix as you need to guess what was in their clipboard. Certes (talk) 12:39, 26 July 2022 (UTC)
:::::The memory problem appears to be cumulative if I run m or v in isolation they do fine but when running the whole bunch there is a massive spike in memory claim that occurs at the same spot around v or x, but also others don't release their claims so it builds up. It could be related to the Sun Grid Engine caching for performance reasons. I've checked the program for errant global vars and it's fine there is nothing holding onto data. I might try separating the backlinks retrieval portion to a different program so it exits between each item clearing any memory claims. -- GreenC 16:48, 26 July 2022 (UTC)
::::::I think it is fixed. A combination of repetitive backlinks reported by the API and inefficiencies in the program magnifying those repetitions. It should never use more than about 25MB of ram, but with "V" (and "v") it was as high as 1 gigabyte. Why V? I suspect it's due to WP:V which is so commonly linked outside mainspace. V exposed the problem, but it was occurring at a smaller scale with everything else. (The API typically and erroneously reports 100s of the same backlink - I don't know why it's always done this.) "V" had 2.5 million non-unique occurrences. Add to this the program was inefficient in how it dealt with the repetitions, it added up and the Grid Engine was nope and dropped the job. Right now it's starting over rebuilding the database, it should be back to normal soon. -- GreenC 05:44, 27 July 2022 (UTC)
:::::::Thanks very much. The current version looks right, considering that it's for a few hours rather than the usual 24. Is it possible to add the namespace of the link target to the query? I'm not sure how you're extracting the data but, for example, Quarry would run its SQL much faster with "and pl_namespace=0". Certes (talk) 11:21, 27 July 2022 (UTC)
::::::::[https://www.mediawiki.org/wiki/API:Backlinks API:Backlinks]. When I first made this program (not your fork of it) around April 2015, Quarry was only about 6 months old I think, anyway I wasn't aware of it, and I wanted something that would run from anywhere which left the API. Speed is not an issue when running daily, unless it takes > 24hrs. Your job completes in about 2 hours, it is exceptionally big. The API behavior of multiple results is weird but can be adjusted for. If it continues to be a problem I can look into Quarry, getting a JSON file would nice. -- GreenC 15:41, 27 July 2022 (UTC)
:::::::::In that case, blnamespace is what I meant, but I'm not clear what it should be set to: the several namespaces in which relevant links appear, or ns 0 to which relevant links lead. If my job is taking two hours then I should be checking fewer targets; any clues as to which entries take the most time would help with that. Certes (talk) 18:27, 27 July 2022 (UTC)
::::::::::Below is an 'ls' of the data files. The timestamps show how long each took to complete. The file size is misleading as the program filters out namespaces. Like "V" (and "v" they are indenitcal to the API) is not very large filesize, but took almost 25 minutes to complete. It took about 85m to finish not 120m my mistake. V/v is about 50 minutes. U/u 20 minutes. N/n 10 minutes. Those are the big three and use 95% of the time (is that right?). Probably due to WP:V, WP:U and WP:N. -- GreenC 19:28, 27 July 2022 (UTC)
:::::::::::Thanks. I'll take V/v, U/u and N/n out then. U and N rarely get a hit. V gets more but I'm less confident about fixing them as most of them require me to guess what article the editor was thinking of. Certes (talk) 20:57, 27 July 2022 (UTC)
:::::::::::All working as normal today, and an hour faster than previously. Thanks again for your help. Certes (talk) 10:03, 28 July 2022 (UTC)
::::::::::::Yes, finished in 25 minutes. No single one took very long (or much memory!). You are welcome and thanks for reporting it because it uncovered a problem in the program that only became evident at scale. -- GreenC 15:52, 28 July 2022 (UTC)
{{collapse begin}}
22930 Jul 27 09:11 0.new
127027 Jul 27 09:11 1.new
16924 Jul 27 09:11 2.new
15575 Jul 27 09:11 3.new
15540 Jul 27 09:11 4.new
14709 Jul 27 09:12 5.new
12741 Jul 27 09:12 6.new
17054 Jul 27 09:12 7.new
15220 Jul 27 09:12 8.new
14745 Jul 27 09:12 9.new
7476 Jul 27 09:13 10.new
6315 Jul 27 09:13 100.new
15741 Jul 27 09:13 A.new
13776 Jul 27 09:13 B.new
16104 Jul 27 09:13 C.new
13410 Jul 27 09:13 D.new
13301 Jul 27 09:14 E.new
12605 Jul 27 09:14 F.new
13550 Jul 27 09:14 G.new
13518 Jul 27 09:14 H.new
14387 Jul 27 09:14 I.new
13005 Jul 27 09:14 J.new
12845 Jul 27 09:14 K.new
14099 Jul 27 09:14 L.new
13174 Jul 27 09:14 M.new
39805 Jul 27 09:18 N.new
13668 Jul 27 09:19 O.new
13088 Jul 27 09:19 P.new
11858 Jul 27 09:19 Q.new
14160 Jul 27 09:19 R.new
14529 Jul 27 09:19 S.new
13146 Jul 27 09:19 T.new
15718 Jul 27 09:21 U.new
96856 Jul 27 09:45 V.new
12403 Jul 27 09:45 W.new
12797 Jul 27 09:45 X.new
13659 Jul 27 09:45 Y.new
13403 Jul 27 09:45 Z.new
15741 Jul 27 09:45 a.new
13776 Jul 27 09:45 b.new
16104 Jul 27 09:45 c.new
13410 Jul 27 09:46 d.new
13301 Jul 27 09:46 e.new
12605 Jul 27 09:46 f.new
13550 Jul 27 09:46 g.new
13518 Jul 27 09:46 h.new
14387 Jul 27 09:46 i.new
13005 Jul 27 09:46 j.new
12845 Jul 27 09:46 k.new
14099 Jul 27 09:46 l.new
13174 Jul 27 09:46 m.new
39805 Jul 27 09:51 n.new
13668 Jul 27 09:51 o.new
13088 Jul 27 09:51 p.new
11858 Jul 27 09:51 q.new
14160 Jul 27 09:51 r.new
14529 Jul 27 09:51 s.new
13146 Jul 27 09:51 t.new
15718 Jul 27 09:53 u.new
96856 Jul 27 10:16 v.new
12403 Jul 27 10:16 w.new
12797 Jul 27 10:16 x.new
13659 Jul 27 10:16 y.new
13403 Jul 27 10:16 z.new
217699 Jul 27 10:17 ABC
5951 Jul 27 10:17 Accolade.new
118095 Jul 27 10:17 Acre.new
89027 Jul 27 10:17 Admiral.new
22088 Jul 27 10:17 Alphabet.new
29758 Jul 27 10:17 Amber.new
4295 Jul 27 10:17 Amen.new
31785 Jul 27 10:17 Aperture.new
2643 Jul 27 10:17 Ash.new
2643 Jul 27 10:17 ash.new
44238 Jul 27 10:17 Atlantic.new
1375 Jul 27 10:17 Back.new
1375 Jul 27 10:17 back.new
36337 Jul 27 10:17 Bay.new
36337 Jul 27 10:17 bay.new
53374 Jul 27 10:17 Bowling.new
53374 Jul 27 10:17 bowling.new
2048 Jul 27 10:17 Cabinet
36569 Jul 27 10:17 Captain.new
36569 Jul 27 10:17 captain.new
12368 Jul 27 10:17 Calvary.new
12368 Jul 27 10:17 calvary.new
26920 Jul 27 10:17 Caterpillar.new
28665 Jul 27 10:17 Chancellor.new
28665 Jul 27 10:17 chancellor.new
31754 Jul 27 10:17 Chestnut.new
31754 Jul 27 10:17 chestnut.new
4924 Jul 27 10:17 Chin.new
725 Jul 27 10:17 Clipboard.new
725 Jul 27 10:17 clipboard.new
44162 Jul 27 10:17 Colony.new
44162 Jul 27 10:18 colony.new
3070 Jul 27 10:18 Colonies.new
3070 Jul 27 10:18 colonies.new
55 Jul 27 10:18 Colors.new
55 Jul 27 10:18 colors.new
565 Jul 27 10:18 Colours.new
565 Jul 27 10:18 colours.new
138372 Jul 27 10:19 Company.new
138372 Jul 27 10:20 company.new
6611 Jul 27 10:20 Companies.new
6611 Jul 27 10:20 companies.new
14699 Jul 27 10:20 Consul.new
14699 Jul 27 10:20 consul.new
76725 Jul 27 10:20 Colorado
3180 Jul 27 10:21 Commonwealth.new
3180 Jul 27 10:21 commonwealth.new
30657 Jul 27 10:21 Conservative.new
1206 Jul 27 10:21 Conservatives.new
113900 Jul 27 10:21 Corvette.new
2005 Jul 27 10:21 Corvettes.new
28639 Jul 27 10:21 Delphi.new
48181 Jul 27 10:21 Family.new
48181 Jul 27 10:21 family.new
2257 Jul 27 10:21 Families.new
2257 Jul 27 10:21 families.new
61603 Jul 27 10:21 Icon.new
61603 Jul 27 10:21 icon.new
6665 Jul 27 10:21 Icons.new
6665 Jul 27 10:21 icons.new
5801 Jul 27 10:21 Interpreter.new
5801 Jul 27 10:21 interpreter.new
70977 Jul 27 10:21 Jupiter.new
12095 Jul 27 10:21 Knot.new
12095 Jul 27 10:21 knot.new
80891 Jul 27 10:21 Krishna.new
121459 Jul 27 10:21 Lead.new
121459 Jul 27 10:21 lead.new
127 Jul 27 10:21 Liberal
180 Jul 27 10:21 Libertarian
183969 Jul 27 10:22 Madonna.new
183969 Jul 27 10:22 madonna.new
65528 Jul 27 10:22 Mass.new
65528 Jul 27 10:22 mass.new
5378 Jul 27 10:22 Meta.new
770 Jul 27 10:22 Ministry
3160 Jul 27 10:22 Model.new
3160 Jul 27 10:22 model.new
176677 Jul 27 10:23 Moon.new
176677 Jul 27 10:23 moon.new
214735 Jul 27 10:23 National
199067 Jul 27 10:23 Oxygen.new
76332 Jul 27 10:23 Primate.new
76332 Jul 27 10:23 primate.new
5462 Jul 27 10:23 Roland.new
346 Jul 27 10:24 Ronaldo.new
68973 Jul 27 10:24 Salt.new
68973 Jul 27 10:24 salt.new
16813 Jul 27 10:24 Season.new
16813 Jul 27 10:24 season.new
44306 Jul 27 10:24 Shiraz.new
44306 Jul 27 10:24 shiraz.new
53287 Jul 27 10:24 Spire.new
53287 Jul 27 10:24 spire.new
153867 Jul 27 10:24 Stream.new
153867 Jul 27 10:24 stream.new
11482 Jul 27 10:24 Telegram.new
3845 Jul 27 10:24 Thermal.new
3845 Jul 27 10:24 thermal.new
88519 Jul 27 10:24 Tree.new
88519 Jul 27 10:24 tree.new
3102 Jul 27 10:24 Trojan
3102 Jul 27 10:24 trojan
167 Jul 27 10:24 U.S.
2334 Jul 27 10:24 Victory.new
26424 Jul 27 10:24 Ardennes.new
19159 Jul 27 10:24 Aspen.new
1884 Jul 27 10:24 Baler.new
105737 Jul 27 10:25 Batman.new
20662 Jul 27 10:25 Battle.new
53364 Jul 27 10:25 Bethlehem.new
439921 Jul 27 10:25 Birmingham.new
11530 Jul 27 10:25 Boulder.new
54094 Jul 27 10:25 Brampton.new
14995 Jul 27 10:25 Calvados.new
208354 Jul 27 10:25 Cambridge.new
71179 Jul 27 10:25 Canterbury.new
15715 Jul 27 10:25 Caracal.new
203571 Jul 27 10:26 Christchurch.new
78460 Jul 27 10:26 Cicero.new
43543 Jul 27 10:26 Durango.new
18943 Jul 27 10:26 East
296629 Jul 27 10:26 Edmonton.new
12304 Jul 27 10:26 Esplanade.new
25247 Jul 27 10:26 Eye.new
32977 Jul 27 10:26 Flint.new
151 Jul 27 10:26 Gladstone.new
81116 Jul 27 10:26 Gloucester.new
56266 Jul 27 10:26 Greenwich.new
780 Jul 27 10:26 Guna.new
21889 Jul 27 10:26 Horsham.new
199436 Jul 27 10:26 Hyderabad.new
89915 Jul 27 10:26 Ipswich.new
15229 Jul 27 10:26 Ithaca.new
132579 Jul 27 10:27 Lagos.new
68478 Jul 27 10:27 La
18993 Jul 27 10:27 Leek.new
439197 Jul 27 10:27 Liverpool.new
26324 Jul 27 10:27 Loire.new
54 Jul 27 10:27 Loni.new
8106 Jul 27 10:27 Malmesbury.new
35538 Jul 27 10:27 Mansfield.new
7545 Jul 27 10:27 March.new
16434 Jul 27 10:27 Mold.new
25849 Jul 27 10:27 Moselle.new
33698 Jul 27 10:27 New
270789 Jul 27 10:27 New
205009 Jul 27 10:28 Norfolk.new
112023 Jul 27 10:28 Norwich.new
28431 Jul 27 10:28 Ore.new
71930 Jul 27 10:28 Pali.new
83138 Jul 27 10:28 Panama
373705 Jul 27 10:28 Perth.new
99124 Jul 27 10:28 Piedmont.new
22133 Jul 27 10:28 Pueblo.new
73659 Jul 27 10:28 Punjab.new
30869 Jul 27 10:28 Reading.new
100419 Jul 27 10:29 Republic
19646 Jul 27 10:29 Rye.new
23084 Jul 27 10:29 Saga.new
6106 Jul 27 10:29 Saint
5866 Jul 27 10:29 St.
11630 Jul 27 10:29 Saint
5336 Jul 27 10:29 St.
97107 Jul 27 10:29 St.
22068 Jul 27 10:29 Stanford.new
255991 Jul 27 10:29 Surrey.new
93952 Jul 27 10:29 Tripoli.new
50366 Jul 27 10:29 Troy.new
38853 Jul 27 10:29 Van.new
18130 Jul 27 10:29 Vosges.new
21909 Jul 27 10:29 Warwick.new
15455 Jul 27 10:29 Angels.new
23662 Jul 27 10:29 Arsenal.new
38084 Jul 27 10:29 Avalanche.new
2391 Jul 27 10:29 Barbarians.new
1558 Jul 27 10:29 Bears.new
5145 Jul 27 10:29 Border
296 Jul 27 10:29 Broncos.new
463 Jul 27 10:29 Buccaneers.new
1063 Jul 27 10:29 Canadiens.new
15399 Jul 27 10:29 Cavaliers.new
751 Jul 27 10:29 Cheetahs.new
367 Jul 27 10:29 Corinthians.new
3529 Jul 27 10:29 Coyotes.new
9722 Jul 27 10:29 Crusaders.new
5268 Jul 27 10:29 Dolphins.new
3090 Jul 27 10:29 Dragons.new
4159 Jul 27 10:29 Ducks.new
160 Jul 27 10:29 Eagles.new
45 Jul 27 10:29 Flames.new
48481 Jul 27 10:29 Force.new
181 Jul 27 10:29 Griquas.new
2627 Jul 27 10:29 Hawks.new
27971 Jul 27 10:29 Heat.new
653 Jul 27 10:29 Hornets.new
5809 Jul 27 10:29 Hurricanes.new
949 Jul 27 10:29 Jaguars.new
223 Jul 27 10:29 Jays.new
1571 Jul 27 10:29 Leopards.new
43470 Jul 27 10:30 Lightning.new
2409 Jul 27 10:30 Lions.new
229 Jul 27 10:30 Ospreys.new
1981 Jul 27 10:30 Pelicans.new
2413 Jul 27 10:30 Penguins.new
9026 Jul 27 10:30 Pirates.new
4012 Jul 27 10:30 Predators.new
2731 Jul 27 10:30 Rockets.new
802 Jul 27 10:30 Rockies.new
7330 Jul 27 10:30 Saints.new
9918 Jul 27 10:30 Saracens.new
3954 Jul 27 10:30 Sharks.new
3306 Jul 27 10:30 Stars.new
6305 Jul 27 10:30 Thunder.new
2129 Jul 27 10:30 Tigers.new
26592 Jul 27 10:30 Titans.new
3808 Jul 27 10:30 Twins.new
98682 Jul 27 10:30 Vikings.new
663 Jul 27 10:30 Warriors.new
3396 Jul 27 10:30 Wasps.new
5597 Jul 27 10:30 Wolves.new
6 Jul 27 10:30 Zunz.new
795 Jul 27 10:30 Orsini.new
226 Jul 27 10:30 Rockefeller.new
32 Jul 27 10:30 Paintal.new
483 Jul 27 10:30 Rothschild.new
8 Jul 27 10:30 Pevsner.new
4861 Jul 27 10:30 O'Reilly.new
62 Jul 27 10:30 Primo
18 Jul 27 10:30 Cimarosa.new
53 Jul 27 10:30 Narasimha
505 Jul 27 10:30 Caracciolo.new
155 Jul 27 10:30 Bakunin.new
665 Jul 27 10:30 Weber.new
26 Jul 27 10:30 Malevich.new
57 Jul 27 10:30 Korotayev.new
18 Jul 27 10:30 Krauser.new
186 Jul 27 10:30 Ghazali.new
266 Jul 27 10:30 Touré.new
190 Jul 27 10:30 Sadat.new
288 Jul 27 10:30 Rajguru.new
289 Jul 27 10:30 Maitland.new
83 Jul 27 10:30 Strozzi.new
90 Jul 27 10:30 Delacroix.new
167 Jul 27 10:30 Reuter.new
185 Jul 27 10:30 Baden
31 Jul 27 10:30 Lessing.new
129 Jul 27 10:30 Boyle.new
96 Jul 27 10:30 Aelian.new
48 Jul 27 10:30 Zichy.new
64 Jul 27 10:30 Nomura.new
204 Jul 27 10:30 Takeda.new
21 Jul 27 10:30 Gilbert
265 Jul 27 10:30 Batista.new
939 Jul 27 10:30 Andrássy.new
544 Jul 27 10:30 Prabhu.new
165 Jul 27 10:30 Tyszkiewicz.new
22 Jul 27 10:30 Mommsen.new
251 Jul 27 10:30 Köppen.new
492 Jul 27 10:30 Della
168 Jul 27 10:30 Bernstein.new
32 Jul 27 10:30 Tippett.new
380 Jul 27 10:30 Sanseverino.new
51 Jul 27 10:30 Pucci.new
377 Jul 27 10:30 Hieronymus
113 Jul 27 10:30 Ghirlandaio.new
65 Jul 27 10:30 Beckett.new
711 Jul 27 10:30 O'Ryan.new
273 Jul 27 10:30 Neumann.new
10 Jul 27 10:30 Matsushita.new
1276 Jul 27 10:30 Ferrero.new
114 Jul 27 10:30 Dietz.new
59 Jul 27 10:30 Amorim.new
29 Jul 27 10:30 Wankel.new
594 Jul 27 10:30 Uexküll.new
20 Jul 27 10:30 Stirner.new
80 Jul 27 10:30 Sridhar.new
234 Jul 27 10:30 Rossetti.new
150 Jul 27 10:30 Nassar.new
115 Jul 27 10:30 Morandi.new
160 Jul 27 10:30 Bulgakov.new
25 Jul 27 10:30 Barks.new
136 Jul 27 10:30 Agnelli.new
350 Jul 27 10:30 Teleki.new
134 Jul 27 10:30 Tarnowski.new
574 Jul 27 10:30 Hamdan.new
93 Jul 27 10:30 Guicciardini.new
589 Jul 27 10:30 Clark.new
97 Jul 27 10:30 Borromeo.new
22 Jul 27 10:30 Bazzi.new
51 Jul 27 10:30 Wolf-Ferrari.new
357 Jul 27 10:30 Sylvester.new
26 Jul 27 10:30 Schichau.new
164 Jul 27 10:30 Scarlatti.new
67 Jul 27 10:30 Noriega.new
24 Jul 27 10:30 Bohlen.new
40 Jul 27 10:30 Boiardo.new
45 Jul 27 10:30 Bosman.new
446 Jul 27 10:30 Braun.new
9 Jul 27 10:30 Gabrielli.new
56 Jul 27 10:30 Haider.new
49 Jul 27 10:30 Jayachandran.new
72 Jul 27 10:30 Jellinek.new
332 Jul 27 10:30 Manning.new
28 Jul 27 10:30 Naryshkin.new
157 Jul 27 10:30 Sachs.new
118 Jul 27 10:30 Sacks.new
101 Jul 27 10:30 Saunders.new
159 Jul 27 10:30 Uccello.new
204 Jul 27 10:30 Velazquez.new
29 Jul 27 10:30 Wills.new
60 Jul 27 10:30 Bergman.new
759 Jul 27 10:30 Haim.new
18588 Jul 27 10:30 Agamemnon.new
3872 Jul 27 10:30 Antigone.new
33458 Jul 27 10:30 Bloomsbury.new
36678 Jul 27 10:30 Cabaret.new
494 Jul 27 10:30 Can-Can.new
23895 Jul 27 10:30 Carousel.new
7172 Jul 27 10:30 Cyrano
47072 Jul 27 10:30 Dune.new
13573 Jul 27 10:30 Euphoria.new
6460 Jul 27 10:30 Falstaff.new
13338 Jul 27 10:30 Faust.new
575 Jul 27 10:30 Fra
1650 Jul 27 10:30 Gidget.new
16873 Jul 27 10:31 Gladiator.new
85498 Jul 27 10:31 Julius
10409 Jul 27 10:31 Medea.new
7415 Jul 27 10:31 Mystic
536 Jul 27 10:31 Peaky
9674 Jul 27 10:31 Peer
16265 Jul 27 10:31 Pericles.new
60538 Jul 27 10:31 Quartz.new
9418 Jul 27 10:31 Salome.new
49778 Jul 27 10:31 St.
84 Jul 27 10:31 The
9885 Jul 27 10:31 Ansible.new
20259 Jul 27 10:31 Arrow.new
57727 Jul 27 10:31 Daily
672758 Jul 27 10:31 The
8853 Jul 27 10:32 Decanter.new
11944 Jul 27 10:32 Dissent.new
13559 Jul 27 10:32 Germania.new
7858 Jul 27 10:32 Guernica.new
29403 Jul 27 10:32 Life.new
6739 Jul 27 10:32 The
809 Jul 27 10:32 The
195831 Jul 27 10:32 The
13864 Jul 27 10:32 Referee.new
2987 Jul 27 10:32 Sunday
24360 Jul 27 10:32 Sunday
154416 Jul 27 10:32 The
5692 Jul 27 10:32 Cage.new
872 Jul 27 10:32 Carpenters.new
2853 Jul 27 10:32 Chrysalis.new
133 Jul 27 10:32 Doors.new
324 Jul 27 10:32 Fernando.new
62059 Jul 27 10:32 Grenade.new
38621 Jul 27 10:32 Guru.new
125 Jul 27 10:32 Happy.new
970 Jul 27 10:32 Hello.new
190 Jul 27 10:32 Jojo.new
13288 Jul 27 10:32 Pink.new
84108 Jul 27 10:33 Sugar.new
16057 Jul 27 10:33 anchorage.new
25 Jul 27 10:33 barks.new
105737 Jul 27 10:33 batman.new
109392 Jul 27 10:33 derby.new
166471 Jul 27 10:33 jersey.new
107237 Jul 27 10:33 limerick.new
121643 Jul 27 10:33 louvre.new
332 Jul 27 10:33 manning.new
7545 Jul 27 10:33 march.new
99124 Jul 27 10:34 piedmont.new
118 Jul 27 10:34 sacks.new
1443 Jul 27 10:34 sandbanks.new
26151 Jul 27 10:34 slough.new
255991 Jul 27 10:34 surrey.new
50366 Jul 27 10:34 troy.new
29 Jul 27 10:34 wills.new
523 Jul 27 10:34 The.new
523 Jul 27 10:34 the.new
48 Jul 27 10:34 Is.new
48 Jul 27 10:34 is.new
337 Jul 27 10:34 were.new
199 Jul 27 10:34 That.new
199 Jul 27 10:34 that.new
370 Jul 27 10:34 said.new
1155 Jul 27 10:34 One.new
1155 Jul 27 10:34 one.new
5430 Jul 27 10:34 goes.new
{{collapse end}}
Bot updating Webarchive template is adding "url" same as existing "url2"
This bot made a group of WaybackMedic 2.5 edits in June where it "rescued" an archive link in the {{para|url}} parameter of {{tl|Webarchive}}, replacing it with a [https://web.archive.org/web/20100105013709/http://canoeicf.com/site/canoeint/if/downloads/result/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf?MenuID=Results%2F1107%2F0%2CMedal%5Fwinners%5Fsince%5F1936%2F1510%2F0 this link] which was already in the {{para|url2}} parameter. Two examples of this are {{diff|Grant Bramwell|1093238567|957849833|Grant Bramwell: revised 1 June 2022}} and {{diff|List of ICF Canoe Sprint World Championships medalists in men's kayak|1095093520|1093813352|List of ICF Canoe Sprint World Championships medalists in men's kayak: revised 26 June 2022}}. Can the bot remove the duplicate url2/date2/title2 parameters and renumber any subsequent url3/date3/title3, etc.? I've fixed over 500 of these edits myself, but there are still [https://en.wikipedia.org/w/index.php?search=insource%3A%2F%5C%7B%5C%7BWebarchive+*%5C%7Curl%3Dhttps%3A%5C%2F%5C%2Fweb.archive.org%5C%2Fweb%5C%2F20100105013709%5C%2Fhttp%3A%5C%2F%5C%2Fcanoeicf.com%5C%2Fsite%5C%2Fcanoeint%5C%2Fif%5C%2Fdownloads%5C%2Fresult%5C%2FPages%25201-41%2520from%2520Medal%2520Winners%2520ICF%2520updated%25202007-2.pdf%5C%3FMenuID%3DResults%252F1107%252F0%252CMedal%255Fwinners%255Fsince%255F1936%252F1510%252F0%2F+insource%3A%2F%5C%7Curl2%3Dhttps%3A%5C%2F%5C%2Fweb.archive.org%5C%2Fweb%5C%2F20100105013709%5C%2Fhttp%3A%5C%2F%5C%2Fcanoeicf.com%5C%2Fsite%5C%2Fcanoeint%5C%2Fif%5C%2Fdownloads%5C%2Fresult%5C%2FPages%25201-41%2520from%2520Medal%2520Winners%2520ICF%2520updated%25202007-2.pdf%5C%3FMenuID%3DResults%252F1107%252F0%252CMedal%255Fwinners%255Fsince%255F1936%252F1510%252F0%2F&title=Special%3ASearch&go=Go&ns0=1 over 700 remaining to be fixed]. Thanks. -- Zyxw (talk) 03:54, 9 August 2022 (UTC)
:That was part of the deprecation of WebCite which is a dead archive provider. It didn't account for dups. It's complicated here because even though {{para|url}} and {{para|url2}} are the same, {{para|title}} and {{para|title2}} are different - which do you choose. I think the best course is the keep {{para|url}} set and remove the {{para|url2}} set, at least based on two examples. In terms of renumbering that is not required as the webarchive template is designed to allow any numbers up to 10, so long as there is a {{para|url}} .. aka {{para|url1}} .. is the only requirement. I'll start looking at this today. -- GreenC 15:35, 9 August 2022 (UTC)
:: {{Reply|GreenC}} I agree with keeping the {{para|url}} set and removing the {{para|url2}} set when there is a duplicate URL and that is what I did for the 500+ already fixed. I also thought {{tl|Webarchive}} might automatically handle the missing {{para|url2}} set and display the {{para|url3}} set, but as per these tests that is not the case:
:: archive with url/date/title, url2/date2/title2, and url3/date3/title3
::* {{Webarchive |url=https://web.archive.org/web/20100105013709/http://canoeicf.com/site/canoeint/if/downloads/result/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf?MenuID=Results%2F1107%2F0%2CMedal%5Fwinners%5Fsince%5F1936%2F1510%2F0 |date=5 January 2010 |title=Medal Winners – Olympic Games and World Championships (1936–2007) – Part 1: flatwater (now sprint). CanoeICF.com. International Canoe Federation. |url2=https://web.archive.org/web/20100105013709/http://canoeicf.com/site/canoeint/if/downloads/result/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf?MenuID=Results%2F1107%2F0%2CMedal%5Fwinners%5Fsince%5F1936%2F1510%2F0 |date2=5 January 2010 |title2=Wayback Machine |url3=https://web.archive.org/web/20160113142416/http://www.bcu.org.uk/files/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf |date3=13 January 2016 |title3=BCU.org.uk}}
:: url2/date2/title2 removed with url3/date3/title3 remaining
::* {{Webarchive |url=https://web.archive.org/web/20100105013709/http://canoeicf.com/site/canoeint/if/downloads/result/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf?MenuID=Results%2F1107%2F0%2CMedal%5Fwinners%5Fsince%5F1936%2F1510%2F0 |date=5 January 2010 |title=Medal Winners – Olympic Games and World Championships (1936–2007) – Part 1: flatwater (now sprint). CanoeICF.com. International Canoe Federation. |url3=https://web.archive.org/web/20160113142416/http://www.bcu.org.uk/files/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf |date3=13 January 2016 |title3=BCU.org.uk}}
:: url2/date2/title2 removed and url3/date3/title3 renumbered
::* {{Webarchive |url=https://web.archive.org/web/20100105013709/http://canoeicf.com/site/canoeint/if/downloads/result/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf?MenuID=Results%2F1107%2F0%2CMedal%5Fwinners%5Fsince%5F1936%2F1510%2F0 |date=5 January 2010 |title=Medal Winners – Olympic Games and World Championships (1936–2007) – Part 1: flatwater (now sprint). CanoeICF.com. International Canoe Federation. |url2=https://web.archive.org/web/20160113142416/http://www.bcu.org.uk/files/Pages%201-41%20from%20Medal%20Winners%20ICF%20updated%202007-2.pdf |date2=13 January 2016 |title2=BCU.org.uk}}
:: -- Zyxw (talk) 16:15, 9 August 2022 (UTC)
:::Reported at Template_talk:Webarchive#Gaps_in_argument_sequence. I wrote the template originally but Trappist did a major rewrite so I'm not sure if that is my bug or his. I processed the first 500 articles and there are only 3 with a {{para|url3}} suggesting 40 or 50 at most in the whole bunch. Anyway it won't be difficult to renumber them. -- GreenC 16:26, 9 August 2022 (UTC)
::::Ah miscalculated it's 733 not 7,330 :) It's done see anything more let me know. -- GreenC 17:08, 9 August 2022 (UTC)
::::Fixed the webarchive bug. -- GreenC 18:06, 9 August 2022 (UTC)
Bad webcitation link replacement
So I've just found out that GreenC bot made [https://en.wikipedia.org/w/index.php?title=Empty_Saddles_(in_the_Old_Corral)&diff=1095013933&oldid=993148078 edits like this], replacing a dead archive link with [https://web.archive.org/web/20140813072851/http://americancowboy.com/culture/top-100-western-songs another dead archive link]. Would it be possible to replace that archive link with, say, [https://web.archive.org/web/20101019002745/http://americancowboy.com/culture/top-100-western-songs this one] that actually works? Thanks very much! Graham87 11:48, 26 August 2022 (UTC)
:Bots are not 100% perfect. It relies on the Wayback API to determine live links and it is not perfect so for those errors it depends on human intervention to correct. The alternative is not to use bots at all , in which case most links never get fixed at all due to the scale, it's back-end boring work people want bots to do, but there is not guarantee bots, or for that matter people, will not make mistakes. The question is the scale of mistakes. -- GreenC 15:08, 26 August 2022 (UTC)
::Yeah fair enough, soft 404's and all. On re-reading my message I spectacularly failed at phrasing it clearly ... there are [https://en.wikipedia.org/w/index.php?title=Special:LinkSearch/https://web.archive.org/web/20140813072851/http://americancowboy.com/culture/top-100-western-songs&limit=500&offset=0&target=https%3A%2F%2Fweb.archive.org%2Fweb%2F20140813072851%2Fhttp%3A%2F%2Famericancowboy.com%2Fculture%2Ftop-100-western-songs nearly a hundred more such links]; could you instruct the bot to replace them with a working archive (i.e. the one linked above)? I thought that would be the easiest way to fix this problem. I tried changing the archive link on InternetArchiveBot's side and asking it to fix the affected articles, but that didn't do what I intended. Graham87 13:34, 27 August 2022 (UTC)
:::OK it's done. Yeah there's no way to automate replace of one archive with another via IABot. That would be a good feature though when finding soft-404s. -- GreenC 16:16, 27 August 2022 (UTC)
::::Opened Phab {{phab|T316438}} .. no idea if or when. -- GreenC 16:34, 27 August 2022 (UTC)
Avoid editing inside HTML comments
GreenC bot now edits inside HTML comments eg. Special:Diff/1107954452, but I suggest it not to. Although the edit in this example happened to be harmless (even useful), in general, comments could be used for a wide range of reasons, so there is a higher risk that automatic edits could break their intentions. Wotheina (talk) 03:49, 2 September 2022 (UTC)
:That's true but there is a positive trade-off so for a couple reasons I am OK fixing certain (not all) link rot in comments, as I have been doing for 7 years. If someone wants to preserve a block of immutable wikitext they should use the talk page, user page or offline - otherwise anyone can edit the comment or delete it entirely. Comments can be strangely formatted, I take measures, auto and manual, to check commented text before posting a live diff. -- GreenC 05:39, 2 September 2022 (UTC)
Stopping backlinks report during wikibreak
Hello, and thanks again for the useful Backlinks reports. I'm currently taking a Wikibreak and have attempted to exclude my list from the bot's tasks {{diff|User:Certes/stopbutton|prev|1108531344|thus}} but it {{diff|User:Certes/Backlinks/Report|prev|1108603582|still ran}} today. It's not a problem for me if the reports continue but, if you'd like to save some resources by stopping it properly, please go ahead. Certes (talk) 11:25, 5 September 2022 (UTC)
:Fixed, it was seeing
in the "#" comment. First time this code has been tested :) Have a good break. -- GreenC 05:14, 6 September 2022 (UTC)
Please Update the monthly list of Top 10000 wikipedia users by Article Count
Exactly what purpose did this edit serve? Edit summary is misleading at best
https://en.wikipedia.org/w/index.php?title=Rodney_Marks&diff=1095741886&oldid=1091111369 108.246.204.20 (talk) 20:17, 3 October 2022 (UTC)
:Don't use {{tlx|dead link}} if the citation has a working {{para|archive-url}}. -- GreenC 20:46, 3 October 2022 (UTC)
::it doesn't. "this page is not available". 108.246.204.20 (talk) 04:15, 14 October 2022 (UTC)
:::Ah soft-404. Removed. O also updated the IABot databace. -- GreenC 04:24, 14 October 2022 (UTC)
A cookie for you!
style="background-color: #fdffe7; border: 1px solid #fceb92;"
|style="vertical-align: middle; padding: 5px;" | 120px |style="vertical-align: middle; padding: 3px;" | Ulises12345678 (talk) 11:00, 9 October 2022 (UTC) |
:Thank you. For the Cookie. -- GreenC 14:12, 9 October 2022 (UTC)
RSSSF
Why is this bot changing "website=rsssf.com" to "website=RSSSF", where there is already "publisher=RSSSF" parameter, and then in many pages you get stupid outcome like [https://i.ibb.co/26Ztmps/rsssf.png this] with double RSSSF linking? Snowflake91 (talk) 10:27, 7 February 2023 (UTC)
:Yeah it's not ideal, a work in progress. In any case the problem is there should not be both {{para|work}} and {{para|publisher}} use one or the other not both. And should not use a domain name, use the name of the site, is best practice on Wikipedia. The re are so many RSSSF citations, and so many problems with them, I've done a lot of work to fix them but there are still things that need more work. -- GreenC 15:22, 7 February 2023 (UTC)
::Prefer {{para|website}} over {{para|publisher}}. {{tlx|cite web}} does not include {{para|publisher}} in the citation's metadata.
::—Trappist the monk (talk) 16:18, 7 February 2023 (UTC)
:::Special:Diff/1038698982/1138241646 -- GreenC 21:44, 8 February 2023 (UTC)
I think all the doubles are cleared, if you see any more or other problems let me know. -- GreenC 21:45, 8 February 2023 (UTC)
WaybackMedic
{{ping|GreenC}} It seems that WaybackMedic 2.5 is running by GreenC bot 2. However, I can't find its source code of version 2.5 in the [https://github.com/greencardamom/WaybackMedic Github repo]. I need to read the latest code to learn its current behavior. Have you published it yet? -- NmWTfs85lXusaybq (talk) 14:04, 24 March 2023 (UTC)
:I can send snippets or functions if you want for anything you are interested in. The entire codebase is not currently available for public due to containing some proprietary information. It's written in Nim, and some awk utils. -- GreenC 14:44, 24 March 2023 (UTC)
::The bot detection of businessweek.com you mentioned in Wikipedia:Village_pump_(technical)/Archive_203#businessweek.com_links may be bypassed by simply assigning an user agent of a web browser in the header of http requests, such as Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.50 Safari/537.36. As far as I know from version 2.1, WaybackMedic may execute external commands (via execCmdEx) to determine page status and the assignment of user-agent should be easily implemented via some available parameters. By the way, as of version 2.1, I can see the validate_robots function is implemented in medicapi.nim. -- NmWTfs85lXusaybq (talk) 16:55, 24 March 2023 (UTC)
:::Thank you for the suggestion to use a browser agent. I tried it, they appear to limit based on query rate, and it's pretty sensitive. I was able to trigger it by manually requesting 8 headers rapidly then it stopped working, sending a header with "HTTP/1.1 307 s2s_high_score" and redirect to a javascript challenge ("press and hold button"). Maybe I could slow the bot down enough between queries, it would be difficult, and extremely slow, perhaps a month or longer for 10k articles, and would need to verify every header is not 307 otherwise abort and manually clear the challenge. GreenC 21:36, 24 March 2023 (UTC)
::::If they limit the query rate based on ip, you can find some web proxies to accelerate this procedure as your bot may behave like a web crawler. After you collect and validate some free proxies, you can just apply them alternately to your bot, although their stability is not guaranteed. -- NmWTfs85lXusaybq (talk) 03:47, 25 March 2023 (UTC)
:::::I have access to a web proxy that uses home based IPs and it still didn't work. Maybe the solution is to pull every URL into a file and process them outside the bot with a simple script that waits x seconds between each header query. Then feed the results to the bot which URLs are dead. It can run for however long it wouldn't matter. Trying to do it inside the bot is too error prone too complicated and ties up the bot too long. -- GreenC 04:11, 25 March 2023 (UTC)
::::::It's a good idea to run this job outside the bot. However, I'm not sure what you mean by {{talk quote inline|a web proxy that uses home based IPs}}. Have you tried high-anonymity proxies? Did you change proxy IP every time you made a new request? NmWTfs85lXusaybq (talk) 04:45, 25 March 2023 (UTC)
:::::::The IPs change with every request, and the IPs are sourced to home broadband users globally, so they are not detectable by CIDR block. I don't know how they got blocked, maybe Cloudflare is on this service and recorded all of the IPs. -- GreenC 14:46, 25 March 2023 (UTC)
::::::::Then I suppose your proxy strategy is OK. Please make sure your web proxy has high anonymity if all of your configuration works fine. -- NmWTfs85lXusaybq (talk) 15:20, 25 March 2023 (UTC)
:::::::::I ran this bot-block avoidance script and it took forever. What I discovered is just about every link should be archived. Either 404, soft-404 or better-off-dead. The later because the links went to content that was behind a paywall or otherwise messed up in some way - so the archived version is better in nearly every case. -- GreenC 14:17, 3 April 2023 (UTC)
::::::::::I see you mentioned some awk scripts as a workaround at Wikipedia:Link_rot/URL_change_requests#businessweek.com. However, I can't find the meta directory businessweek.00000-10000 you referred to in the Github repo of InternetArchiveBot and WaybackMedic. NmWTfs85lXusaybq (talk) 07:15, 24 April 2023 (UTC)
:::::::::::Oh that's a note to myself, if you want the awk script let me know it's nothing more than going through a list of URLs, pausing between each to avoid rate limiting, getting the headers and recording the results and if it's a bot block header notify and abort the script. It also shuffles the agent string. It seemed to learn agent strings and block based on those which could be avoided by retiring an agent and adding a new one. -- GreenC 13:47, 24 April 2023 (UTC)
Backlinks report 2023
User:Certes/Backlinks/Report has stopped updating. The bot is running, as User:GoingBatty/Backlinks/Report still updates. I've not changed the job list in User:Certes/Backlinks since 8 May, nor pressed the stopbutton. Do you know how to restart the report please? Certes (talk) 12:17, 4 June 2023 (UTC)
:The process from June 2nd crashed for unknown reason and turned into a zombie preventing future runs. I can't kill it so I contacted Toolforge admins for help. -- GreenC 14:17, 4 June 2023 (UTC)
::Working again now – thanks! Certes (talk) 21:50, 4 June 2023 (UTC)
Archiving chapter urls
This is a bit of an edge case with GreenC bot's archive repair task, so I wanted to get your opinion. In several articles where I'm citing an archived book that has separate PDFs for each chapter, I use the |archive-url= parameter for the chapter url (since that's the most important one) and have a Wayback url for the book url in the |url= field. It's not ideal, but I'm not sure how else to handle it. My brief search also found this thread where you indicated that |archive-url= was okay to use for the chapter url. However, GreenC bot switches the |archive-url= field to be the archive of the |url= field (example here).
Is there a better way to format these citations? I'm not able to find any. Otherwise, is there any way I can mark the citations to be ignored by the bot? This seems like a relatively rare case; I imagine it's not worth modifying the bot to handle. Thanks, Pi.1415926535 (talk) 22:14, 14 August 2023 (UTC)
:Special:Diff/1170358971/1170410520. Another option:
::{{Cite book |last=Vanasse Hangen Brustlin, Inc |url=http://greenlineextension.eot.state.ma.us/docs_beyondLechmere.html |title=Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis |date=August 2005 |publisher=Massachusetts Bay Transportation Authority |archiveurl=https://web.archive.org/web/20160705134151/http://greenlineextension.eot.state.ma.us/docs_beyondLechmere.html |archivedate=July 5, 2016}} {{webarchive |url=https://web.archive.org/web/20160705151132/http://greenlineextension.eot.state.ma.us/documents/beyondLechmere/MIS8-05-Chapter4.pdf |date=2016-07-05 |title=Chapter 4: Identification and Evaluation of Alternatives – Tier 1}}
:I like this better because it doesn't hack the cite book template arguments. The downside is the display is a little messier. Another way with some duplication:
::{{Cite book |last=Vanasse Hangen Brustlin, Inc |chapter-url=http://greenlineextension.eot.state.ma.us/documents/beyondLechmere/MIS8-05-Chapter4.pdf |chapter=Chapter 4: Identification and Evaluation of Alternatives – Tier 1 |title=Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis |date=August 2005 |publisher=Massachusetts Bay Transportation Authority |archiveurl=https://web.archive.org/web/20160705134151/http://greenlineextension.eot.state.ma.us/docs_beyondLechmere.html |archivedate=July 5, 2016}} From {{webarchive |url=https://web.archive.org/web/20160705134151/http://greenlineextension.eot.state.ma.us/docs_beyondLechmere.html |date=2016-07-05 |title=Beyond Lechmere Northwest Corridor Study: Major Investment Study/Alternatives Analysis}}
:To keep the bot off the citation add {{tlx|cbignore}} template after the end of the cite book but inside the ref tags. -- GreenC 02:17, 15 August 2023 (UTC)
::Thanks, much appreciated. Pi.1415926535 (talk) 17:15, 15 August 2023 (UTC)
:::{{ping|GreenC}} Please take a look at Special:Diff/1171111146, where the bot edited several citations already tagged with {{tl|cbignore}}. Thanks, Pi.1415926535 (talk) 06:35, 21 August 2023 (UTC)
::::I found two problems. 1) The {{tld|cbignore}} should follow directly after the template it targets: Special:Diff/1171510462/1171514730 - I think the cbignore docs has this. 2) My bot has a known limitation. Within any block of text between new lines (ie. a paragraph of text), if there is more than one cbignore, the citations the cbignore follows all need to be unique. In this case the two citation are mirror copies. The bot ignored the cbignore for that reason (it has to do with disambiguate it needs to know which citation to target). So, I modified one of the citations, they are now unique: Special:Diff/1171514730/1171514803 (changed the semi-colon to colon in the publisher field for the first citation) -- a bit quirky but tested and it works now. I do recommend though using the alt suggestions above because while my bot honors cbignore most other bot's do not and eventually in the future it's probable some other tool will try to "fix" what it detects as an error (archive URL in the url field). -- GreenC 15:45, 21 August 2023 (UTC)
Incorrect dead flags and archive.today
Hello {{u|GreenC}}! Your bot recently made [https://en.wikipedia.org/w/index.php?title=Pok%C3%A9mon&diff=1171107668 this strange edit] to Pokémon. In it, the bot changed "archive.is" and "archive.ph" to "archive.today". I'm not sure what purpose this has. The task is not explained on User:GreenC bot.
Furthermore, the bot flagged these three sources as dead:
- https://www.theguardian.com/technology/gamesblog/2013/oct/11/pokemon-blockbuster-game-technology
- https://order.mandarake.co.jp/order/detailPage/item?itemCode=1052117728
- https://www.nytimes.com/1997/12/20/news/big-firms-failure-rattles-japan-asian-tremors-spread.html
But as you can see, the above links are not dead. So something must've gone wrong there. I've [https://en.wikipedia.org/w/index.php?title=Pok%C3%A9mon&diff=1171158544 remarked] these refs as live. Cheers, Manifestation (talk) 11:04, 19 August 2023 (UTC)
:Archive.today is what the owner of archive.today wants us to use, it's a redirector that sends traffic to other domains as they are available. The reason those three got marked dead is there was an archive URL in the {{para|url}} field and the bot moved it to the {{para|archive-url}} field and the bot assumes if someone put an archive URL in the main {{para|url}} field it was probably a dead URL. -- GreenC 14:47, 19 August 2023 (UTC)
::{{re|GreenC}} Aaah! So that's why. I wrote the text, so I take full responsibility for the url=
/ archive-url=
mixup. As for archive.today: I looked at our article, and it cites [https://twitter.com/archiveis/status/1081276424781287427 this tweet] from 4 January '19 in which the owner states that the .is domain might stop working soon. However, the domain is still active. In fact, the '@' handle used by the account to this day is still "@archiveis". I've used archive.today many times, including this year. It always gave me either a .is or a .ph link. Cheers, Manifestation (talk) 15:07, 19 August 2023 (UTC)
:::Yeah it redirects to one of the 6 domains like .is or .ph .. but if one of those domains gets shut down by the registar, he can switch where it redirects to easily, without having to change every link on Wikipedia. -- GreenC 15:24, 19 August 2023 (UTC)
::::Hmm ok. Well I guess we should honor his/her request then. For the sake of clarity, maybe the description of Job #2 / WaybackMedic 2.5 on User:GreenC bot could be expanded a little to include a mention of archive.today? archive.today is not part of the Internet Archive, so the term "WaybackMedic" is a bit misleading. - Manifestation (talk) 16:03, 19 August 2023 (UTC)
:::::Alright I updated fix #21 which also now links to Help:Using_archive.today#Archive.today_compared_to_.is,_.li,_.fo,_.ph,_.vn_and_.md. It started out as Wayback-specific then expanded to all archive providers but I kept the original name anyway. -- GreenC 16:41, 19 August 2023 (UTC)
::::::@GreenC Hi! I know that .today is the domain to be used, but every time i try to open a link with .today it returns me a "This site cannot be reached" type of error, and the same goes with .ph links. The only active links i get are the one with .is Astubudustu (talk) 10:55, 2 April 2024 (UTC)
:::::::This is because the DNS resolver you are using is hosted on CloudFlare and that won't work (well) with archive.today domains see Archive.today#Cloudflare_DNS_availability -- GreenC 15:38, 2 April 2024 (UTC)
WaybackMedic 2.5 adding unneceesary URLs
I saw the bot's task run on Guardians of the Galaxy (film) [https://en.wikipedia.org/w/index.php?title=Guardians_of_the_Galaxy_%28film%29&diff=1172020348&oldid=1171941348 here] and it made edits to three references that used {{tl|Cite Metacritic}}, {{tl|Cite Box Office Mojo}}, and {{tl|Cite The Numbers}}, adding in unnecessary URLs and marking the links as dead. The citation templates construct the urls from the given parameters (as most follow a common format on those sites) and were not dead. Didn't know if this was a bot issue, or the templates themselves doing something that is flagging the citations to make the bot adjust them. I can look into the templates to see what the issues may be if that is ultimately the case (and to know what to look for for the error). - Favre1fan93 (talk) 14:16, 24 August 2023 (UTC)
:That is a bot error. It is in 9 articles. I rolled them back (you got 2). Thanks for the report. -- GreenC 15:00, 24 August 2023 (UTC)
::No problem, thank you! - Favre1fan93 (talk) 15:26, 24 August 2023 (UTC)
Timestamp mismatch
This bot is changing the archive-url as [https://en.wikipedia.org/w/index.php?title=Aahvaanam&diff=1173401818&oldid=1167493279 seen here], but it is not changing the archive-date as required, creating a timestamp mismatch error, as [https://en.wikipedia.org/w/index.php?title=Aahvaanam&oldid=1173401818#cite_note-4 seen here]. I just recently emptied this category and now it has over 80 articles (when I wrote this) in it again. Your help would be appreciated. Thanks. Isaidnoway (talk) 05:57, 2 September 2023 (UTC)
:I am aware, did it in two steps, because of the way this particular job was programmed, it was easier this way. You saw it in that 30-minute gap between runs-- GreenC 16:11, 2 September 2023 (UTC)
My bot can empty that category easily. It was 40,000 a week ago. Got it down to few hundred edge cases, which I assume you fixed manually, thank you. I'd like to fully automate it, but right now it's all integrated into WP:WAYBACKMEDIC which can't be fully automated, so I run it on request. -- GreenC 16:16, 2 September 2023 (UTC)
User:Isaidnoway, I'm running a bot job to convert archive.today URLs from short-form to long-form. Example. It is exposing old problems with date mismatches that are showing up in :Category:CS1 errors: archive-url -- after this bot job completes, I'll run another bot to fix the date mismatches, it will clear the tracking cat. No need to do anything manually. -- GreenC 04:57, 8 September 2023 (UTC)
:Hi GreenC! My bot is following yours today. There were several instances when your bot reformatted archive URLs like this edit, mine fixed the archive dates like my bot did in the following edit. My bot is running on :Category:CS1 errors: dates, and pulling the archive date from the archive URL. Any chance your bot could do it all in one edit? Thanks! GoingBatty (talk) 18:25, 8 September 2023 (UTC)
::I used to be able to fix archive.today problems and date mismatches in the same process, but it was semi-automated. Fixing archive.today problems can and should be full-auto, so I separated that out to its own process that uses EventStream to monitor real-time when a new short-form link shows up, log the article name, and once a month or so fix them - all full-auto. [https://uk.wikipedia.org/w/index.php?title=Judas_Priest&diff=prev&oldid=40360299 Across 100s of wikis]. The downside is this program can't fix date mismatch problems. I want to fix date mismatches automatically, and hope to do that eventually with its own process. Once I have that developed I can see about including it in the archive.today program, so it saves the extra edit, when the source of the date mismatch is archive.today short to long conversion.
::The tracking category will be cleared in the next few hours, it's currently generating diffs. This is a one-off event clearing out the backlog of archive.today problems which exposed a lot of problems. Going forward there will be much smaller numbers. We both currently have bots that can clear that category on request, do you know how to update the docs for the category page? -- GreenC 23:41, 8 September 2023 (UTC)
:::Not sure which category page you're referring to, but most of the text on these category pages comes from Help:CS1 errors, so if you updated the help page, it would also appear on the appropriate category page. GoingBatty (talk) 03:15, 9 September 2023 (UTC)
:::::Category:CS1 errors: archive-url. Do you want me to include your bot in the doc as available to clear the cat on-request? I'm going to mention WaybackMedic is available, but only if there are more than 500 entries. -- GreenC 14:25, 9 September 2023 (UTC)
:::::I don't have a bot to clear :Category:CS1 errors: archive-url. GoingBatty (talk) 18:21, 9 September 2023 (UTC)
::::::Oh I see I misinterpreted what you said above I thought it was fixing mismatched dates but it was actually fixing an incomplete date. -- GreenC 19:12, 9 September 2023 (UTC)
Economy of Zimbabwean
I need some help Mindthem (talk) 21:13, 25 September 2023 (UTC)
:@Mindthem: How would you like the bot to help with the Economy of Zimbabwe article? GoingBatty (talk) 19:20, 29 September 2023 (UTC) {{tps}}
Backlinks
Hi there! I see your bot delivered a new Backlinks report for Certes, but I didn't receive an update today. Could you please give the bot a nudge? Thanks! GoingBatty (talk) 19:21, 29 September 2023 (UTC)
:I saw some messages this morning Toolforge was down due to NFS, likely your run didn't complete before the outage. I see it aborted around 09:32GMT and Certes finished at 09:28 .. with minutes to spare. I'll run yours again now. -- GreenC 19:37, 29 September 2023 (UTC)
::Report received - thank you! GoingBatty (talk) 02:49, 30 September 2023 (UTC)
Bot put italics in strange places
I don't know what [https://en.wikipedia.org/w/index.php?title=Counsel%27s_Opinion&diff=1180484599&oldid=1108222641 happened here], but the bot appears to have put italics in place where they didn't belong, and then missed putting them in [https://en.wikipedia.org/w/index.php?title=Counsel%27s_Opinion&diff=1180927666&oldid=1180484599 where they did belong]. Given that the bot had to edit three times, I imagine this bot run was stressful for you. If this code is still active, it might need yet another debugging. – Jonesey95 (talk) 18:26, 19 October 2023 (UTC)
:Yeah this was a pain, every time I thought it was done, some new issue came up. And getting those ticks right, in the right place, after the fact, wasn't easy. Anyway this task is done for me (1,200 articles deletion of {{tlx|BFI}}). If you see any problems they need manual adjustment. I don't think the number of problems is very large from spot checking. -- GreenC 18:35, 19 October 2023 (UTC)
::I think you are correct, based on my perusal of the list of Linter errors. – Jonesey95 (talk) 18:54, 19 October 2023 (UTC)
Flagging non-dead link as dead (2)
Hello. Why did GreenC bot rewrite url-status=live to url-status=dead in Special:Diff/1186567077 for a live URL? The URL [https://www.nationalgeographic.com/science/article/101101-ibex-goats-dam-italy-bighorn-sheep-wyoming-science] is alive, at least from Japan as of 2023-11-24 04:50 UTC (checked with Firefox and Chrome on Windows 10). Wotheina (talk) 05:05, 24 November 2023 (UTC)
:It's freemimum content. Open an incognito window and see if it gives a different result. I tried to archive premium content pages for NatGeo because they use a freemium wall. View page source and search on "freemiumContentGatingEnabled". -- GreenC 05:42, 24 November 2023 (UTC)
::I see. I agree on switching from paywalls to archives, but for such unintuitive edits please write the intention somewhere, as in edit summary or embedded comment, or at least in User:GreenC/WaybackMedic 2.5. I think url-access= is the best way, but I guess you are not using that because there is no option "url-access=freemium" yet. Wotheina (talk) 06:46, 24 November 2023 (UTC)
:::{{para|url-access|freemium}} is a great idea. Until it appears, I think {{para|url-access|live}} is less bad, or for a bonus point {{para|url-access|live<!--freemium-->}} which can be converted in bulk later. I can see the goats too, but I block a lot of third-party scripts which might hide them in standard browsing. Certes (talk) 16:25, 8 December 2023 (UTC)
::::Regarding "{{para|url-access|live}} is less bad", did you mean "{{para|url-status|live}} is less bad"? Wotheina (talk) 17:24, 8 December 2023 (UTC)
:::::Yes, sorry, I was confusing the two parameters. {{para|url-access|live}} seems more accurate than {{para|url-access|dead}} here. The least bad value for status might be {{para|url-status|limited}}. I can't find a definition of limited to determine whether freemium falls within its scope. Certes (talk) 18:34, 8 December 2023 (UTC)
::::When I did NatGeo, I didn't have the ability to add archive URLs with {{para|url-status|live}} so unfortunately they were all set to dead. I have since added this ability after it was requested at Wikipedia:Link_rot/URL_change_requests#vh1.com by User:Alexis Jazz. I'm not sure about going back and resetting from dead to live the NatGeo links that are freemium, that would probably require some special one-off code and a lot of time to recheck all the links. But it's the kind of thing anyone could probably do pretty easily, if you have code to parse and edit CS1 templates. -- GreenC 17:34, 8 December 2023 (UTC)
Backlinks timing
Hi there! I noticed that the Backlinks report hasn't run yet today for {{U|Certes}} or me. Looking at the bot's contributions, I see the report is running later each day this week. Could you please check the bots to see what's going on? Thank you! GoingBatty (talk) 15:22, 8 December 2023 (UTC)
:I started monitoring Buenos Aires as an experiment, not because its new links are likely to be wrong but because socks of a certain puppetmaster love linking to it. I've just removed it from my list, in case this widely-linked page is causing problems. Certes (talk) 16:18, 8 December 2023 (UTC)
::They are forks of the same script, they run on different cron jobs and directories, thus not be possible to effect each other. If both are not working I dunno I'll check. -- GreenC 17:40, 8 December 2023 (UTC)
GoingBatty & Certes, I found a bug that only shows up when running from cron. It wasn't apparent when the script was on Toolforge because there you signify the working directory with
with the jsub command which masked the problem. The effect of the bug was to create duplicate entries in the list at /Backlinks which is why it kept taking longer each run. For example GoingBatty had 7 instances of "hamlet" (from the scripts perspective), one for the original and 6 for each day the script ran. So I think the best solution is wipe out the data files again and start over, the data files look kind of weird anyway. The usual, you'll see the message about new entries, then the next one should be good. -- GreenC 18:20, 8 December 2023 (UTC)
:On December 8, the bot started over and published a report, but didn't publish a report for December 9. Could you please check it again? Thanks! GoingBatty (talk) 04:34, 10 December 2023 (UTC)
GoingBatty, I don't know what happened. Nevertheless, it is working now. It looks system-level. Cron logs show the process ran, but it didn't. No apparent reason, and I can't replicate. Weird. Let me know if it doesn't run again, I enabled verbose logging. Also during testing I moved the job time to around 5:30 GMT .. or do you want the previous 8:30? Or some other time? -- GreenC 06:01, 10 December 2023 (UTC)
:Thank you! I'd prefer the previous 8:30, as I'm likely to see the 5:30 job right before I should be going to sleep, and then be tempted to stay up too late to address them immediately. Thanks! GoingBatty (talk) 07:04, 10 December 2023 (UTC)
User:Certes during testing your most recent report lost some data, seen below. -- GreenC 06:01, 10 December 2023 (UTC)
{{collapse begin |title=lost entries}}
class="wikitable"
|+New backlinks for 2023-12-10 | ||
Target | Linker | History |
---|---|---|
rowspan="2" style="vertical-align: top;"|0 | Template:Vital article link/testcases | [https://en.wikipedia.org/w/index.php?title=Template%3AVital%20article%20link%2Ftestcases&action=history history] |
List of Mexican inventions and discoveries | [https://en.wikipedia.org/w/index.php?title=List%20of%20Mexican%20inventions%20and%20discoveries&action=history history] | |
rowspan="1" style="vertical-align: top;"|10 | Template:Vital article link/testcases | [https://en.wikipedia.org/w/index.php?title=Template%3AVital%20article%20link%2Ftestcases&action=history history] |
rowspan="6" style="vertical-align: top;"|ABC News | 2024 in American television | [https://en.wikipedia.org/w/index.php?title=2024%20in%20American%20television&action=history history] |
January 14–17, 2022 North American winter storm | [https://en.wikipedia.org/w/index.php?title=January%2014%E2%80%9317%2C%202022%20North%20American%20winter%20storm&action=history history] | |
Robert Kiyosaki | [https://en.wikipedia.org/w/index.php?title=Robert%20Kiyosaki&action=history history] | |
World food crises (2022-present) | [https://en.wikipedia.org/w/index.php?title=World%20food%20crises%20%282022%2Dpresent%29&action=history history] | |
Jaidynn Diore Fierce | [https://en.wikipedia.org/w/index.php?title=Jaidynn%20Diore%20Fierce&action=history history] | |
Mike Johnson (Louisiana politician) | [https://en.wikipedia.org/w/index.php?title=Mike%20Johnson%20%28Louisiana%20politician%29&action=history history] | |
rowspan="1" style="vertical-align: top;"|Aelian | Aelian (Roman author) | [https://en.wikipedia.org/w/index.php?title=Aelian%20%28Roman%20author%29&action=history history] |
rowspan="1" style="vertical-align: top;"|Arsenal | Citadel of Parma | [https://en.wikipedia.org/w/index.php?title=Citadel%20of%20Parma&action=history history] |
{{collapse end}}
:Thanks; I'll take a look at those. I've a slight preference for 0830 over 0530, as I tend to look at the entries about 1000-1200 UTC and the fresher the better. Certes (talk) 16:07, 10 December 2023 (UTC)
It didn't run again. The logging helped. I'm narrowing in on the problem and made some changes. We'll see what happens next run. -- GreenC 21:18, 11 December 2023 (UTC)
At some point when this issue is resolved, are you willing to open Backlinks to other users? For example, see Wikipedia:Help desk#Notification for Links to Pages by Other Users. Thanks! GoingBatty (talk) 04:18, 12 December 2023 (UTC)
So, it does appear my IP is being rate limited by WMF. I moved all my tools off-site and it's generating a lot of traffic. The solution is to add a retry loop with pauses. Will try that next. -- GreenC 14:42, 12 December 2023 (UTC)
:Would moving the tools on-site be a solution? I know they just made that a whole lot more difficult by deprecating GridEngine. Certes (talk) 14:49, 12 December 2023 (UTC)
::That will take time because I think it will require building a custom kerbenos image which is a learning curve. I have a ticket open asking them about this but no reply yet. I should have been using a retry loop anyway so this will help either way, I have a function, but was apparently lazy and didn't call it. -- GreenC 15:26, 12 December 2023 (UTC)
:::A lot of people will be climbing the same learning curve. It would be nice if we had a page for giving each other a leg up. Sadly (or perhaps gratefully), I've never had to use Kubernetes and so can't be of much assistance. Certes (talk) 16:17, 12 December 2023 (UTC)
::::I hope to learn the system eventually, probably good thing to know. -- GreenC 18:02, 12 December 2023 (UTC)
Ran both manually with the new code. It will keep requesting when it gets a 429 ("Too many requests"). It tries 20 times with a 2 second delay. I have seen it make up to 5 requests, but it will depend on WMF server load. The jobs will run on the regular morning schedule tomorrow. -- GreenC 18:02, 12 December 2023 (UTC)
:If it's not too much work, escalating the delay might be good for both the program and the server, e.g. if the nth try fails, wait n seconds. (Exponential is recommended but seems extreme.) Certes (talk) 18:15, 12 December 2023 (UTC)
::There are too many tool making constant requests it almost doesn't matter, they are going to saturate regardless. I'm concerned because if slowed down too much the work never gets done. Will keep on it. It will email if/when it reaches 20. -- GreenC 19:49, 12 December 2023 (UTC)
:::Hmmm. It sounds as if they need a bigger computer. They can afford it. Certes (talk) 22:33, 12 December 2023 (UTC)
Everything looks good today. Thank you. The only difference from before is that the output now appears alphabetically by target rather than sorted as in the parent page, but that's not a problem. Certes (talk) 10:13, 13 December 2023 (UTC)
:Because there were duplicates in the parent page I had to unique the list which required a sort. I tried to unique it in a way that doesn't require a sort ie.
, but for some reason it dropped one of the entries.. I didn't have time to investigate it so went with the tried and true method of
. You can try this yourself with the list of entries and see if the results differ in the number of entries on output compared to input. -- GreenC 15:43, 13 December 2023 (UTC)
::That sounds very reasonable. (sort -u
may work on your system too.) Certes (talk) 16:33, 13 December 2023 (UTC)
Buck Goldstein
Hi there! In this edit, your bot changed an incorrect {{para|url}} parameter, which added the article to :Category:CS1 errors: URL. Should the bot have done something different, or should it ignore the {{para|url}} parameter and only update the {{para|archiveurl}}/{{para|archive-url}} parameter? Thanks! GoingBatty (talk) 06:02, 18 December 2023 (UTC)
:You mean Special:Diff/1187499427/1190066019. The bot that runs this process is a global bot, it is not programmed to handle templates in different languages, it only operates on the URL itself, not with template knowledge. The bot didn't do anything wrong, that wasn't already there; it's only purpose is to normalize archive.today URLs wherever they happen to be. If that caused the pre-existing error to be exposed in the tracking cat, it's a step forward. -- GreenC 06:32, 18 December 2023 (UTC)
Preserving the correct archived version of archive.today links
In this edit, WaybackMedic 2.5 attempted to reformat a link to archive.today that had multiple different archives, but used the archive of the wrong date. The pre-existing link https://archive.is/2Ljk6 is an archive from 24 November 2023. The link should have been converted to http://archive.today/2023.11.24-014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s (the "long link" for the page), but was instead converted to https://archive.today/20231124014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s , which corresponds to the 6 December 2023 archive. This resulted in the new archive link leading to an archive of a 404 page instead of the successfully archived page, and the archive-date
parameter not matching the timestamp on the page or in the long URL.
Ideally, the bot would notice when the new URL's archive date does not match the old URL's archive date and not make the edit if it cannot resolve this. Also, ideally it would catch when the citation template's archive-date
doesn't match the URL's archive date, and either adjust the template's archive-date
or display some kind of warning. SnorlaxMonster 12:09, 1 January 2024 (UTC)
:Actually, there also appears to be an issue on archive.today's end. While the page https://archive.md/2Ljk6 does have a share option that says that http://archive.today/2023.11.24-014538/https://www.bloomberg.com/press-releases/1999-11-08/pokefans-can-now-eat-their-hearts-out-with-candy-planet-s is the correct long URL, as it turns out, that long URL redirects to the 404 archive as well. In cases like that, I think WaybackMedic 2.5 should not change the URL to the long version, until archive.today corrects their long URLs for URLs with multiple archives. --SnorlaxMonster 12:12, 1 January 2024 (UTC)
::That's strange. Looks like a one-off error at archive.today .. never seen it before. I can't verify every new long archive.today is the same, because of the resource load on archive.today servers would double, and the time it would take for the bot to finish. Unless there is evidence of a widespread problem, but in 7 years and over half a million conversions this is the first time it's been reported. All I can do for now is add a static string to the code to skip processing when it sees 2Ljk6. Other tools might try to do the same conversion like IABot or possibly Citation Bot. This is a tricky problem to solve long term. Ideally archive.today would be notified, is the correct solution. -- GreenC 19:23, 1 January 2024 (UTC)
:::I notified archive.today about the specific issue with the long URL via their "report bug or abuse" button, but I have no idea how likely those reports are to get read. I think just manually excluding that specific case is the best option for now.
:::With regards to validating that the target page is the same, I think it should be as simple as checking the timestamp is the same (ignoring that bug I mentioned in my second message, where the long URL can redirect to the wrong version). I assume whatever API you're using to get the long URL from the short URL returns the archive date of the short URL in the request you are already making—the long URL has the archive date in the URL itself, so to me it seems like it should be possible to validate that the archive date hasn't changed by just comparing those two values, without needing any additional API requests to archive.today. But I also don't know what the code your bot uses, so I can't verify my assumptions about how it works. (I tried taking a look at the [https://github.com/greencardamom/WaybackMedic GitHub page] linked on User:GreenC/WaybackMedic 2.5, but it appears that it is for Wayback Medic 2.1 and doesn't include the fixarchiveis
function that's included in Wayback Medic 2.5.) --SnorlaxMonster 13:22, 2 January 2024 (UTC)
::::There is no API for this. You download the HTML of the short URL page, and the long form is there towards the top (view source search on "long link"). The GitHub code is old, but you can see it [https://github.com/greencardamom/WaybackMedic/blob/master/WaybackMedic%202.1/modules/archiveis/archiveis.awk here] at line 173. If the long form URL goes to a different version of the HTML, as in this case, I would need to download both the short and long HTML page, and run a string comparison to see if they are approximately the same HTML. Thus downloading HTML twice. -- GreenC 22:28, 2 January 2024 (UTC)
:::::Ah okay, I suspected it could just be plain web scraping. Anyway, what I was trying to suggest was just comparing the date in the URL with the date on the HTML page (so there would be no need to resolve the long link). However, I had missed that the date in the long URL you retrieved was the correct one—the issue was entirely that archive.today redirects it. --SnorlaxMonster 23:34, 2 January 2024 (UTC)
bug report
At this edit, GreenC bot copied a malformed wayback machine url from {{para|url}} into {{para|archive-url}}. It ought not to have done it like that.
The wayback machine url is malformed because its timestamp is not an acceptable length (14 digits preferred, 4 or 6 tolerated). cs1|2 emits an error message for single-digit timestamps and another error message when the values assigned to {{para|url}} and {{para|archive-url}} are the same.
—Trappist the monk (talk) 01:46, 30 January 2024 (UTC)
:Also, not clear where {{para|archive-date|2007-06-15}} came from.
:—Trappist the monk (talk) 01:49, 30 January 2024 (UTC)
Bug report: Incorrect archive-date
Hi there! In this edit, the bot added {{para|archive-date|18990101080101}}. Is there something you could add to the bot to prevent the addition of incorrect dates such as this? Thanks! GoingBatty (talk) 18:22, 30 January 2024 (UTC)
:I do have warnings but apparently was lazy and forgot to check the logs. -- GreenC 20:08, 30 January 2024 (UTC)
bug report (2)
{{cl|CS1 errors: archive-url}} recently bloomed. I have just fixed these four articles broken by Wayback Medic 2.5:
Every error was a {{para|archive-date}} mismatch with the {{para|archive-url}} timestamp. {{para|archive-date}} was always off by one day; always earlier than the time stamp except for this one from 2024 Noto earthquake.
—Trappist the monk (talk) 18:57, 1 February 2024 (UTC)
:And then there is this one that is off by a couple of weeks, this one off by a year. So it looks like what I wrote above may not hold much water...
:—Trappist the monk (talk) 19:08, 1 February 2024 (UTC) 19:37, 1 February 2024 (UTC)
The date mismatch error preexisted. The bot only made it more obvious, so that CS1|2 error-checking is now able to see it. I would prefer to fix the archive-date at the same time as expanding archive.today URLs from short to long form (per RfC requirement). However this task is universal it operates on many wiki language sites, it does not have knowledge of template names or arguments in other languages. It only expands a URL wherever it may be, it doesn't look at templates. That would require another universal bot I guess, that can operate on CS1|2 templates in multiple languages. If you want to write one, I have the approval to run it. The reason the dates are frequently offset by 1 day, users add an archive.today link they just created, set {{para|archive-date}} to their relative location, but the archive.today uses UTC time, which has already passed into a new day. The ones offset by a week or year are user entry errors. -- GreenC 21:49, 1 February 2024 (UTC)
:User:Trappist the monk: I have written a separate bot that fixes the date mismatch error populating {{cl|CS1 errors: archive-url}}. Example Special:Diff/1248926553/1248972462. It retrieves the date from the "suggested" date, generated by tCS1|2 in the HTML warning message. This way it can run on other language wikis without needing to deal with language differences. It falls back to ISO mode if it can't get a suggestion. Do you think it is OK to rely on the "suggested" date generated by CS1|2? -- GreenC 14:25, 2 October 2024 (UTC)
::The suggested date is simply the date portion of the archive-url timestamp formatted according to the format specified by {{para|df}} → the global {{tld|use xxx dates}} → format of the date in {{para|archive-date}} → YYYY-MM-DD. Getting the date from the html seems a reasonable thing to do; the grunt work has already been done.
::—Trappist the monk (talk) 15:05, 2 October 2024 (UTC)
bug report (3) Bot ignores cbignore
Here https://en.wikipedia.org/w/index.php?title=Scott_Boman&diff=1226780032&oldid=1226698153 I noticed that the bot edited an external link with cbignore after it. I compared the links before and after the edit to see why the cbignore template was there. The long and short links are from different dates and display different content. The altered link no longer contained the relivent content. This would not matter if the bot observed the cbignore.--198.111.57.100 (talk) 17:05, 4 June 2024 (UTC)
:OK this problem is complicated. There are multiple things going on.
:* All short-form archive.today links need to be expanded to long form. This is required as Wikipedia does not allow URL shortening which has security problems.
:* Archive.today has a bug. When saving links from WebCite, it incorrectly gives the long form.
:::Incorrect: http://archive.today/UfV6G --> https://archive.today/20121120012223/http://romeoareateaparty.org/wordpress/2012-candidates-2/races/u-s-senate/
:::Correct: http://archive.today/UfV6G --> https://archive.today/20121120012223/https://www.webcitation.org/6CIutMLaZ?url=http://romeoareateaparty.org/wordpress/2012-candidates-2/races/u-s-senate/
:Notice the "Correct" version includes the original WebCite URL. The "Incorrect" version excludes the WebCite URL.
:* GreenC bot has a bug in that it can't see cbignore when making these changes.
:* GreenC bot has a bug in so far as it doesn't detect the Archive.today bug
:So I need to make some adjustments to work around the Archive.today bug. I also need to report the bug to Archive.today though there is no guarantee they will fix it. -- GreenC 17:28, 4 June 2024 (UTC)
::*Update the bug is reported to Archive.today -- GreenC 18:14, 4 June 2024 (UTC)
::*:Archive.today fixed it. -- GreenC 21:01, 4 June 2024 (UTC)
:::Thank you!--198.111.57.100 (talk) 16:27, 6 June 2024 (UTC)
Please don't convert old Google patents links to archive.today
This is a very unhelpful change: special:diff/1227937929. The links on the archived page to PDFs and drawings all 404, meaning that the actual content of the patent is not accessible. Nor are any of the other features originally presented by Google patent search. This type of archive page should not ever be used for patents. You should either fix the Google patent URLs, which is fairly trivial (you can see the fix for this page at special:diff/1227941924), or switch to links to the US patent office or similar.
Can you please revert or properly fix all of the similar recent edits you have made across Wikipedia? (Judging from your recent contribution list it seems like there were a lot.) Otherwise you're just creating work for someone else / leaving confused readers. –jacobolus (t) 16:41, 8 June 2024 (UTC)
:1. You should post this in the forum linked in the edit summary: WP:URLREQ#google.com/patents - that's the community forum for this task that everyone is reading.
:2. There is nothing my bot can't do. And there is nothing that is permanent or can't be changed or undone. Do not panic or become upset.
:3. Give me details. I will do it. But I need information. You gave a diff saying it's trivial, but how do I obtain https://archive.today/20121211035219/http://www.google.com/patents?id=lvNwAAAAEBAJ is the same as https://patents.google.com/patent/US417831A ? There is a code in the second URL that does not exist in the first URL.
:Anyway, please follow at URLREQ so others can know what's going on. -- GreenC 16:52, 8 June 2024 (UTC)
Job 18 showing up in WPCleaner
I'm running the WPCleaner and noticed that Error 95 (Editor's signature of link to user space) has flagged the bot, specifically Job 18, on a ton of pages (Arundhathi Subramaniam is one to give an example). It looks like the bot signature is in the "reason" field of the template
I don't have a count of the pages, but it's not an insignificant amount from what I can see. Lindsey40186 (talk) 02:16, 11 June 2024 (UTC)
:I don't know about WPCleaner, or what the error message means. It was an old bot job, that no longer runs. It was a peculiar and difficult situation. -- GreenC 03:56, 11 June 2024 (UTC)
Typo
After Wikipedia:Link_rot/URL_change_requests#deccanchronicle.com, the bot is adding links to Deccan Chronical instead of Deccan Chronicle. See [https://en.wikipedia.org/w/index.php?title=Karthik_Subbaraj&diff=1227476768&oldid=1227217248] and [https://en.wikipedia.org/wiki/Special:WhatLinksHere/Deccan_Chronical]. DareshMohan (talk) 18:59, 14 June 2024 (UTC)
:Oh sheesh, thanks. Fixed Special:Diff/1228320785/1229089609 in 829 pages . -- GreenC 20:17, 14 June 2024 (UTC)
Thanks
Hey, I just want to say thank you for using the Wayback Machine for MTV News for my citations. Can you do that for Drag-On's album Hell and Back? Ill post the original link. JuanBoss105 (talk) 13:30, 2 July 2024 (UTC)
:Hey, I found a link to a MTV.com source that can be used for Rocafella. Can you add it using the wayback machine?
:https://www.mtv.com/news/c1psz3/state-property-members-stress-independence-dont-take-orders&ved=2ahUKEwiS1cGYwIiHAxUdD1kFHf0oCVYQFnoECCIQAQ&usg=AOvVaw1m9yMSZqvcQC7xuV2PKS9D JuanBoss105 (talk) 13:53, 2 July 2024 (UTC)
::User:JuanBoss105: I found an archive URL with a different source URL: https://web.archive.org/web/20150122173241/http://www.mtv.com/news/1498885/state-property-members-stress-independence-dont-take-orders/
::I found it using the archive's search feature: [https://web.archive.org/mtv.com/search/%22state%20property%20members%20stress%20independence%22 Search: "State Property Members Stress Independence"].
::You can find other archive URLs at MTV.com this way.
::For example in Special:Diff/1231668617/1232196891 you added https://www.mtv.com/news/v0uzg8/norah-jones-tops-a-mil-at-1-kanye-west-settles-for-2 you can find the archive URL by going to this search page: [https://web.archive.org/mtv.com/search/%22norah%20jones%20tops%20a%20mil%22 Search: "Norah Jones tops a mil"]. -- GreenC 16:07, 2 July 2024 (UTC)
Tampabay.com
Stop running this right now on tampabay.com links. Every one I've checked is wrong. It is adding archive links (okay) to currently live articles, and tagging them as dead (wrong). Also is overriding explicit |url-status=dead to |url-live when it encounters redirects to the main page of tampabay.com. Tired of fixing these because GreenC bot is on a roll. ▶ I am Grorp ◀ 00:21, 12 July 2024 (UTC)
: Clarification: Not every single instance, but too many, for sure. ▶ I am Grorp ◀ 00:31, 12 July 2024 (UTC)
Oh shoot, looks like they used an exotic redirect mechanism, it fooled the bot. I have a way around it, but this is the first I became aware of it. I'll have to reprocess. Anyway, thanks for the info. BTW you should post error reports in the section linked in the edit summary, that is the discussion for this job. -- GreenC 00:38, 12 July 2024 (UTC)
: {{reply|GreenC}} That was gibberish to me so I found this talk page. I just now put a link from there to here. You're welcome to copy this over there, and delete this thread, if that makes more sense. I'll watchlist both. ▶ I am Grorp ◀ 00:42, 12 July 2024 (UTC)
:: Not all of the edits were incorrect or needed correcting. If you want a list of which ones I corrected, then they're in my contributions list from 22:10, 11 July 2024 to 00:37, 12 July 2024 (UTC). All but the first of my corrections has "GreenC bot" in the edit summary. (I edit in a topic area that relies heavily on tampabay.com, many of which are on my watchlist.) ▶ I am Grorp ◀ 00:53, 12 July 2024 (UTC)
Grorp,
- Special:Diff/1233941553/1233989259 - this appears to be a one-off, maybe a network transient. When I run the page again (locally) the problem does not happen. I'd be surprised there are more like this. It can happen but I don't think it's systematic or common. If you see more, let me know.
- Special:Diff/1233948702/1233989465 - exotic redirect problem noted above
- Special:Diff/1233957098/1233990527 - ditto
- Special:Diff/1233959661/1233991011 - archive.today I manually verify beforehand. This one is a manual verification error, which is rare, but not impossible. I can provide a list of the archive.today URLs that were added (193).
I can redress the exotic redirect, which looks to be limited to URLs ending in .ece -- GreenC 01:29, 12 July 2024 (UTC)
:Update: I found 29 instances of the exotic redirect, among the set of 6,846 pages, or less than 1/2 of one percent. Of the archive.today error, there was one in 193, or about the same 1/2 of one percent. Thanks for the report, find any other problems let me know. -- GreenC 02:42, 12 July 2024 (UTC)
:: Thanks. Will do. ▶ I am Grorp ◀ 05:45, 12 July 2024 (UTC)
{{od}}I have no idea how to decipher/restore/resurrect these old pqarchiver links (like in your fourth example above). If there's a writeup, or some tips, please point me in the right direction. I do come across these [https://en.wikipedia.org/w/index.php?search=insource%3A%22pqarchiver%22+scientology&title=Special:Search&profile=advanced&fulltext=1&ns0=1 fairly regularly] in this topic area I edit; many point to old sptimes.com news articles (St Petersburg Times was bought out by Tampa Bay Times). If there is any way I can resurrect an actual copy of some of these old articles, I'd like to try to fix some of them. ▶ I am Grorp ◀ 05:45, 12 July 2024 (UTC)
:I found 63 pqarchiver links (out of the 193 archive.today links added) and they all worked, except this one. If it doesn't exist at archive.org or archive.today it's probably gone forever need to find an alternate source probably. -- GreenC 06:09, 12 July 2024 (UTC)
Other wikis
Do you ever deploy the bot to other wikis to assist with link maintenance and updates? Imzadi 1979 → 18:20, 28 July 2024 (UTC)
:It's a very big job to internationalize the bot for templates, dates etc - I'd like to eventually. But it does update links in the IABot database (iabot.org), and IABot then updates 300+ wikis based on the contents of the database. Thus when my bot discovers a dead link on enwiki, it updates enwiki adding an archive URL, then also updates the IABot database changing the status to "dead" and adding the archive URL into the database. Then IABot scans the 300+ other wikis and when it finds that link, it adds the archive URL, taken from the database. -- GreenC 18:55, 28 July 2024 (UTC)
::I was curious if it would work on the [https://wiki.aaroads.com AARoads Wiki], which uses the same templates as the English Wikipedia, so no internationalization needed. Imzadi 1979 → 19:12, 28 July 2024 (UTC)
:::IABot would be better since it continuously scans pages and fully automatic replace dead links. WaybackMedic does more specialized work on a per-domain basis for many types of issues with manual oversight. A good place to post a request is https://meta.wikimedia.org/wiki/User_talk:InternetArchiveBot -- GreenC 20:49, 28 July 2024 (UTC)
bot destructive
I just had to a manual purge on Eyjafjallajökull after bot had visited as page was from the top displaying The time allocated for running scripts has expired.The time allocated for running scripts has expired. The time allocated for running scripts has expired. This is a complex page calling in a couple of data rich templates usually rendered well within normal parsing allowance of 10 seconds but if the wikipedia infrastucture is under load can fail on an edit. The bot accordingly presently needs a (? manual} check of page output after every use. Often the fail is towards the end of such a page with the references so only obvious on a full page manual skim. Please ensure you do this as many high quality pages have reference lists running into 100's with processing times about the 5 second mark. ChaseKiwi (talk) 21:16, 3 August 2024 (UTC)
Bug report - templates in images in infoboxes
Just wanted to flag Special:Diff/1239809626, doesn't seem to recognise there's a template in that URL. Primefac (talk) 12:03, 12 August 2024 (UTC)
:Oops my regex was stopping at "}" instead of "{" had it reversed. Thanks. -- GreenC 18:23, 12 August 2024 (UTC)
Job 15 GA mismatches stoppage
User:GreenC bot/Job 15 (GA mismatches) has stopped after Wikipedia:Good articles/all was [https://en.wikipedia.org/w/index.php?title=Wikipedia%3AGood_articles%2Fall&diff=1237436963&oldid=1229147724 edited]. Adabow (talk) 10:07, 13 August 2024 (UTC)
:User:Adabow, because of Special:Diff/1229147724/1237436963 by User:Beland. The bot is not aware of Wikipedia:Good articles/all2. It aborted because the number of entries in Wikipedia:Good articles/all is below a magic number ie. it looks suspicious. Everything worked, except I neglected to add an email reminder (only logs) so I didn't notice. Thanks for the ping. -- GreenC 16:17, 13 August 2024 (UTC)
::User:Beland could you verify the lists are correct? There appears to be duplication at the top with two table of contents, for example two entries for "Agriculture, food, and drink". There is also a line that says "View the entire list of all good articles or" in which points to Wikipedia:Good articles/all .. is that still accurate? -- GreenC 16:22, 13 August 2024 (UTC)
:::The duplicate TOCs were being transcluded from the per-topic pages. I suppressed them with "noinclude" tags. The link from subpages still points to /all, but once readers get there they will see "all" is split between /all and /all2. I think that's probably fine for now, unless we want to just stop altogether with combining multiple per-topic pages into one or two massive scrollable lists. -- Beland (talk) 20:33, 13 August 2024 (UTC)
::I think this change could break three bots: FACBot, LivingBot, and GreenC bot. There is a message in the page that says changes to the page layout will break the bots (GreenC bot not mentioned I will add it later). Bots should be notified given time to adjust. (looks like the two bots were notified, ty) There might be other tools and bots as well. -- GreenC 16:34, 13 August 2024 (UTC)
::Actually it looks like the creation of "all2" was in February: Special:Diff/1066123344/1229147724 .. so my bot has not been running properly since. Trying this to better communicate: Special:Diff/1237436963/1240124928 -- GreenC 16:46, 13 August 2024 (UTC)
Broke 139 archive.ph links! They are clearly labeled.
Your bot took the url=
with the live link & altered 139 archive-url=
that had archive.ph links & changed them to "archive.today/[url of live link]". Not only is archive.today a DEAD site, but all my archive links were live.
This is an error when you visit that site:
This site can’t be reached
https://archive.today/ is unreachable.
ERR_ADDRESS_UNREACHABLE
What is the purpose of this? Ɠɧơʂɬɛɖ (talk) 23:35, 15 September 2024 (UTC)
:Archive.today is not dead it works fine [https://www.isitdownrightnow.com/archive.today.html for me and everyone else]. Your local machine's DNS resolver is having temporary problems. See Archive.today#Cloudflare_DNS_availability. Use a different DNS resolver and the problem will be solved. Please use archive.today it is the main gateway host to the site, which redirects to one of the backend sites like archive.ph .. the site is literally called "Archive.today" not "Archive ph", the .ph is an internal thing they do to protect against domain name hijacking. -- GreenC 00:15, 16 September 2024 (UTC)
Archive.today isn't accessible from Italy
Hi, I saw your bot replaced archive.is links with the respective archive.today ones in some pages on italian Wikipedia ([https://it.wikipedia.org/w/index.php?title=Capitoli_de_Le_bizzarre_avventure_di_JoJo&diff=141129393&oldid=141126904 here is an example]). However archive.today redirects to archive.ph, which has apparently been [https://www.commissariatodips.it/profilo/centro-nazionale-contrasto-pedopornografia-on-line/index.html blocked] by italian Internet providers after being reported by police for hosting illegal content. [https://imgur.com/a/archive-ph-status-italy-as-of-09-17-24-Dg9GfZY This] is a screenshot I took and [https://feddit.it/post/6613497 here] are other people talking about it. I wanted to warn you about this because now archived URLs aren't accessible and can't be checked without using proxies. Hope you can fix this. Un mondo a stelle e strisce (talk) 15:55, 17 September 2024 (UTC)
:User:Un mondo a stelle e strisce, thank you for this information. Archive.today has problem sometimes. They created multiple domains: archive.is, .fo, .li, .today, .vn, .md, .ph .. do you know if all are blocked in Italy? I read the [https://feddit.it/post/6613497 discussion] (6 months old) and this appears to be something done by the postal police? You could also try using a different DNS resolver that isn't going through Cloudflare, this is the problem for most people, due to a policy disagreement between Archive.today and Cloudflare -- GreenC 16:36, 17 September 2024 (UTC)
::archive.ph is the only one blocked, the others are all fine and working except for .today that redirects to it and therefore isn't accessible, too. According to the warning displayed when trying to reach the address, postal police took this measure because they found pedopornographic content on the website. I don't think the problem has anything to do with Cloudflare, as the page is still accessible via proxy. Un mondo a stelle e strisce (talk) 21:12, 17 September 2024 (UTC)
:::If you want, we can change everything to .is or whichever. In the mean time, I have disabled the twice-monthly process that converts everything to .today -- GreenC 21:24, 17 September 2024 (UTC)
::::Yes, replacing things with .is would be great, thanks for your help. Un mondo a stelle e strisce (talk) 08:26, 18 September 2024 (UTC)
:::::User:Un mondo a stelle e strisce, changed the first 3,000 pages, which is about 10%, then wait time before continuing ([https://it.wikipedia.org/w/index.php?title=Levitating&diff=prev&oldid=141266255 example]). -- GreenC 01:32, 26 September 2024 (UTC)
User:Un mondo a stelle e strisce, this job is complete. Keep in mind, archive.today will continue to be added in many ways, by editors and bots. If you want to clear them out again, drop me a note. Or if this ban is ever lifted, drop me a note. Cheers. -- GreenC 15:18, 17 October 2024 (UTC)
:Yes, I'll let you know about eventual further developments. Thanks very much for your help. Un mondo a stelle e strisce (talk) 16:14, 17 October 2024 (UTC)
"url-status=usurped" causes a CS1 message
Hi GreenC!
I just noticed that the GreenC bot has flagged many refs as part of an effort to combat the passive spamming of the Judi gambling syndicate.
However, | url-status=usurped
is currently causing a CS1 maintenance message. I am seeing these messages because I opted to make them visible through my common.css. Normally, they can not be seen.
When I preview a page with a usurped
ref, it shows this warning at the top:
:"Script warning: One or more (...) templates have maintenance messages; messages may be hidden (help)."
Also, with me, the altered refs have this bit tagged at the end:
:"CS1 maint: unfit URL (link)"
See :Category:CS1 maint: unfit URL, which currently has 48,594 entries.
Again, the maintenance message is normally not visible, not even to logged-in users. So this isn't an acute problem.
I believe the maintenance message is shown incorrectly. If the URL has been usurped, but the original page was properly archived, then the ref as used on Wikipedia is probably not "unfit", right? What can be done about this?
Cheers, Manifestation (talk) 19:09, 24 September 2024 (UTC)
:It looks like we are tracking all usages of unfit/usurped even legitimate uses and this automatically creates a maintenance message. I don't know what the rationale is. Maybe someone wants to know where the usurped URLs are? -- GreenC 19:35, 24 September 2024 (UTC)
::I have [https://en.wikipedia.org/w/index.php?title=Help_talk:Citation_Style_1&diff=1247745526 started] a thread about this at Help talk:Citation Style 1. This has to be a bug. Cheers, Manifestation (talk) 19:41, 25 September 2024 (UTC)
Oil for your bot
style="background-color: #fdffe7; border: 1px solid #fceb92;"
|rowspan="2" style="vertical-align: middle; padding: 5px;" | 100px |style="font-size: x-large; padding: 3px 3px 0 3px; height: 1.5em;" | Oil for your bot |
style="vertical-align: middle; padding: 3px;" | A hard working bot deserves a refreshing glass of motor oil! Big Blue Cray(fish) Twins (talk) 09:26, 18 November 2024 (UTC) |
Backlinks report not updating
Hi there! After several months taking a break from Backlinks, I've recently started using the report again. I noticed that the bot last created a report on December 3. Could you please jump start the bot for me? Thanks! GoingBatty (talk) 17:03, 6 December 2024 (UTC)
:Thanks for restarting the bot! GoingBatty (talk) 05:03, 7 December 2024 (UTC)
::User:GoingBatty, No problem, thanks for the report because this problem (a missing symbolic links the result of moving to a new computer) was having an effect on multiple tools. -- GreenC 17:41, 8 December 2024 (UTC)
Question about the Wikipedia:Good_articles/all page
Please see my question at Wikipedia_talk:Good_articles/all#Question to the bots. (I wrote it there because I am asking the same question to 3 bots.) Thank you. Prhartcom (talk) 20:03, 9 December 2024 (UTC)
bot "reformatting" valid dates to URL strings
I noticed a few edits such as [https://en.wikipedia.org/w/index.php?title=Beautiful_(Mariah_Carey_song)&diff=prev&oldid=1263681013 this one] where the bot replaced the date with a portion of the URL. Looks like it's specifically happening with webcitation.org URLs that don't have actual dates within the URL itself. = paul2520 💬 19:57, 18 December 2024 (UTC)
:User:Paul2520, problem in the bot fixed. I see you corrected the pages about 27. -- GreenC 21:34, 18 December 2024 (UTC)
:(BTW that string is actually a date encoded in base62 - this [https://github.com/greencardamom/WebCiteBase62Decoder repo] will decode) -- GreenC 21:37, 18 December 2024 (UTC)
::TIL! Thanks for clarifying (and fixing). = paul2520 💬 22:19, 18 December 2024 (UTC)
Bot makes error when seeing non-CS1 already-archived URLs
The domain xenu-directory.net was usurped, and a few days ago I manually checked/fixed all occurrences in mainspace. In the case where a citation uses the square bracket method like {{code|
The following recent edits by GreenC bot include an incorrect tag (and it's still running and adding more):
- https://en.wikipedia.org/w/index.php?title=Aaron_Saxton&curid=26660450&diff=1264164606&oldid=1262817586
- https://en.wikipedia.org/w/index.php?title=APA_Task_Force_on_Deceptive_and_Indirect_Methods_of_Persuasion_and_Control&curid=7589571&diff=1264169964&oldid=1262815584
- https://en.wikipedia.org/w/index.php?title=Brain-Washing_(book)&curid=7881283&diff=1264176579&oldid=1262818821
- https://en.wikipedia.org/w/index.php?title=Citizens_Commission_on_Human_Rights&curid=20949376&diff=1264181897&oldid=1262810467
- https://en.wikipedia.org/w/index.php?title=Hubbard_v_Vosper&curid=37647003&diff=1264202456&oldid=1262815517
- https://en.wikipedia.org/w/index.php?title=Inside_Scientology:_How_I_Joined_Scientology_and_Became_Superhuman&curid=9497749&diff=1264203752&oldid=1262819362
- https://en.wikipedia.org/w/index.php?title=List_of_Masonic_buildings_in_the_United_States&curid=31726117&diff=1264221816&oldid=1262815969
I had to check about 2 dozen of your recent run (showed on my watchlist) and the above errors will need to be fixed by changing their citations into CS1 cite style. I really don't enjoy doing double work! Fix your bot to recognize when a citation is from archive.org instead of a usurped domain. ▶ I am Grorp ◀ 03:48, 21 December 2024 (UTC)
:It's correct. See documentation for {{tlx|usurped}}. -- GreenC 03:51, 21 December 2024 (UTC)
Issues with comment inside reference
Happy New Year! I'm working through :Category:CS1 errors: dates and ran across a couple edits by your bot like [https://en.wikipedia.org/w/index.php?title=Juneau_Monument&diff=1268917726&oldid=1263467142 this one] where the citation template has the {{para|access-date}} commented out for some reason, and your bot doesn't seem to be expecting that. Happy editing! GoingBatty (talk) 21:32, 14 January 2025 (UTC)
:GoingBatty, Sorry, did you find many? I spent time looking at this, and have concluded the bot can't support this without significant work. For now I detect and skip the template. It can edit templates with wikicomments, just not edit fields within templates where the wikicomment co-exists. Also, I'm in the middle of a large batch that was started before this came up. Do you mind if I complete this batch, hopefully it won't be too many . -- GreenC 03:02, 15 January 2025 (UTC)
::I found two: the one I mentioned earlier and [https://en.wikipedia.org/w/index.php?title=Peter_Hessler_bibliography&diff=prev&oldid=1268927993 this one]. If I find a lot more, I'll let you know. Happy editing! GoingBatty (talk) 04:15, 15 January 2025 (UTC)
Bot contributing to CS1 errors URL backlog
on this edit, bot populated url field with wrong input that causes CS1 url error.––kemel49(connect)(contri) 16:48, 17 January 2025 (UTC)
:User:KEmel49: It's GIGO (Garbage In / Garbage Out). Notice
it has "https:/filmograph" .. there is only one slash, should be two. BTW the point of error tracking categories is to catch errors, if the bot contributed to the category, it did the right thing - now we see where before it was obscure. In the 9 years I have been running this bot, it's the first time I have seen this error. I could try to check for and fix it, I'll take a look, but I think this is an extremely rare error.
:A better question is who made the error. The citation was originally fine, then deleted from the article Special:Diff/1261399458/1261401626. Then readded a few days later in its broken form Special:Diff/1261446328/1261487369. Even more weirdly, the citation has the fingerprints of reFill (the "website=web.archive.org") so it looks like it was copied in from a different article. Overall I think the edits made by User:Vialeoncino (who is a WP:SPA probably a COI - the article has a history of COI problems) looks poor quality and someone might consider reviewing or reverting the whole thing back to August 2024. -- GreenC 17:37, 17 January 2025 (UTC)
::I reverted the article back to August 2024, with a talk page section. -- GreenC 17:49, 17 January 2025 (UTC)
::I fixed all articles on Enwiki that have this problem (about 1,300) per Wikipedia:Link_rot/URL_change_requests#missing_slash (Example Special:Diff/1264570681/1270059998) -- GreenC 18:46, 17 January 2025 (UTC)
::I added a new feature to the bot to detect and fix, if encountered. -- GreenC 19:09, 17 January 2025 (UTC)
:::@GreenC, Thanks for taking time on that. I appreciate your work.––kemel49(connect)(contri) 00:50, 18 January 2025 (UTC)
Bot not escaping single-quote pairs in URLs
This is from July, so may already be fixed, but FYI just in case. Gamapamani (talk) 08:21, 25 January 2025 (UTC)
:OK. This may be a one off situation, I've never encountered before, probably because it may have bypassed some functions during the Google URL conversion process. I think your solution Special:Diff/1236608894/1271703664 works so long as the page is being displayed on a Wikimedia website, but it won't work anywhere else, because it mixes two encoding schemes in the URL: percent-encoding and wiki-encoding. It violates one of the core principles of URL encoding, that there be only 1 encoding scheme (percent encoding) discussed in [https://datatracker.ietf.org/doc/html/rfc3986 RFC 3986]. This is an unfortunate situation on Wikipedia broadly because many tools, reports, processes etc.. use URLs from Wikipedia and they often have trouble distinguishing which encoding scheme since one character might be percent and the next is wiki encoded, it's ambiguous what "{{" means outside the context of Wikimedia and even there it can create conflicting information since some URLs actually use those characters. Anyway I changed it here Special:Diff/1271703664/1271793897 thanks for the report. -- GreenC 19:18, 25 January 2025 (UTC)
::Gave you thanks earlier for fixing the "fix" :). I absentmindedly used {{tl|''}} and of course didn't intend to insert a double prime in there... Google seems to be pretty resilient, though (I did try the link out prior to submitting, but didn't notice the visual difference in the URL). I somewhat disagree with (or am missing something in) your point about encoding, though. {{}}
is part of the syntax for generating HTML, it's not going to be in the URL (unless by mistake); if anyone is going to use raw wikitext instead, they better be able to parse it as well. It's just that the template didn't actually generate what I thought it would. Ironically (in terms of mixing encodings), the way to generate exactly the URL the goo.gl redirect gives out would seem to be &
. %27%27
is of course technically sound, but not textually identical. Gamapamani (talk) 09:16, 27 January 2025 (UTC)
:::You should always use percent encoding in URLs. The RFC 3986 is clear on this. Nobody has the resources to correctly determine every instance of {{ in a URL is actually a wiki template or part of the URL. Wikimedia rendering (usually) does, but bot and tool writers will and do have a hard time with that. Don't assume anything, percent encoding when creating URLs is the correct thing. -- GreenC 15:15, 27 January 2025 (UTC)
::::The RFC has no bearing on how URLs are generated, otherwise we'd end up disallowing string concatenation in any programming language, etc. I can sympathize with the pain of having to parse wikitext without the official parser, and it might be considerate toward bot writers to not use template syntax inside wikilinks, but I don't think it's a requirement, as long as the meaning is obvious for a human editor and it's valid wikitext resulting in a compliant URL after parsing. Single quotes are not %-encoded by encodeURIComponent()
for example, because it's not generally needed (and the RFC has a provision for that). The bot didn't encode them either, and it was only a problem because of wikitext parsing, not for any other reason. Gamapamani (talk) 13:49, 30 January 2025 (UTC)
:::::I disagree that URLs in wikitext are wikitext. It would be like saying URLs in markup are markup. Or even that URLs in HTML are HTML. The entire reason for the IETF RFC is precisely for this problem that arises with multiple encoding schemes inside URLs. -- GreenC 14:38, 30 January 2025 (UTC)
::::::https
in wikitext may look like a URL, but it's actually the equivalent of "https
in some other language. The RFC would apply if wikitext were a "final form" directly interpreted by user agents (as HTML is), but it's not (at least as used on Wikipedia). Gamapamani (talk) 18:33, 30 January 2025 (UTC)
:::::::I asked Perplexity.ai - It confirms what you say:
:::::::----
:::::::Key points:
:::::::* The wikitext source file, including templates like {{tld|!}}, is a separate layer of markup that is processed before the final HTML output is generated. This processing occurs within the Wikipedia/MediaWiki system and is not part of the URL itself.
:::::::* The wikitext source file is not the final form of the content and is not directly interpreted by web browsers or other URI consumers. It's an intermediate representation used by the wiki software.
:::::::* The RFC is concerned with the format and encoding of URIs as they are used in web protocols and documents, not with the internal representations or templates used in content management systems.
:::::::In conclusion, the use of wikitext templates like {{tld|!}} in Wikipedia's source files does not violate RFC 3986 because:
:::::::* It's part of an intermediate markup language, not the final URL.
:::::::* The final rendered URL complies with RFC 3986.
:::::::* The wikitext source is outside the scope of what RFC 3986 aims to standardize.
:::::::The mixing of percent-encoding and wikitext-encoding in this context is not a violation, as they operate at different layers of the content creation and rendering process.
:::::::----
:::::::Looks like I've been wrong for 10 years, in multiple discussions. Nobody had an answer before. -- GreenC 00:45, 1 February 2025 (UTC)
::::::::I think this is the first time (that I've been told of, anyway) that an AI has been used to adjudicate the validity of my points, I'm glad it agreed. :) Interesting times! But you've been quite right that all "plain" URL representations in wikitext – the overwhelming majority – should comply with the RFC since they'll be included as-is in HTML; those containing intentional markup to alter them are after all pretty rare – but it can be needed sometimes. Gamapamani (talk) 08:19, 1 February 2025 (UTC)
Bot messing up archive url field
The way bot done this edit, i fear it could mess around with full confidence. kindly fix that.––kemel49(connect)(contri) 12:54, 26 January 2025 (UTC)
:User:KEmel49, another GIGO: [http://www.imir-bg.org/imir/books/malcinstvena%2520politika.pdf%257Curl-status=dead%257Carchive-url=https://web.archive.org/web/20070926235751/http://www.imir-bg.org/imir/books/malcinstvena%2520politika.pdf%257Carchive-date=26 the url] has embedded
which confused the bot. The solution is fix the URL. You did: Special:Diff/1271888544/1271948540 .. this is a bigger problem I posted about here. -- GreenC 17:19, 26 January 2025 (UTC)
BOT caused cite error
Hello, in [https://en.wikipedia.org/w/index.php?title=Roundhay_School&diff=prev&oldid=1269978928 this edit] the BOT caused a "Cite error: The named reference ":0" was defined multiple times with different content". The BOT added {{para|archive-url}} and {{para|archive-date}} to the named reference ":0", but there was a second instance of this ref with all of the fields completed which it did not add the extra fields to. It needs to add the fields to both instances or better still remove the second instance and just leave the named ref tag and add a slash closer". Keith D (talk) 23:38, 4 February 2025 (UTC)
Odd revert acting as anti-vandalism.
The bot made [https://en.wikipedia.org/w/index.php?title=List_of_supertall_skyscrapers&diff=prev&oldid=1276470044 this edit] reverting vandalism. Not complaining but It doesn't look like this was meant as an anti-vandalism bot. Is there a reason it reverted here? TornadoLGS (talk) 02:31, 19 February 2025 (UTC)
:Honestly I'd prefer not to speak on the specifics in the open but if you agree the bot did the correct thing. -- GreenC 02:59, 19 February 2025 (UTC)
probably gigo but ...
Special:Diff/1271516870. Thought you should know.
—Trappist the monk (talk) 17:03, 12 March 2025 (UTC)
:I fixed manually a few days before, it was flagged in the logs: Special:Diff/1266394266/1271222183 .. but overlooked the second identical one. -- GreenC 19:13, 12 March 2025 (UTC)
Alone (i-Ten song)
Not sure what happened at Alone (i-Ten song), but this was a weird edit by the bot I had to revert. ✗plicit 05:39, 16 March 2025 (UTC)
:Thanks for fixing. It's a one-off. I know what happened. Ashok Bhadra and Alone (i-Ten song) were assigned the same internal identifier by the bot, and thus shared the same data directory overlapping content. I prevent this by using very large numbers based on nanosecond time strings and random numbers, it looks to be about a 1 in 999 trillion possibility that this would occur. Possibly there are conditions related to the computers clock that makes it not as random as I think. In 10 years and millions of edits, I think this is only the second time it has been reported. -- GreenC 06:53, 16 March 2025 (UTC)
::Did more research, a collision would actually be expected about 1 in 1 million pages. This is too frequent. I added two digits to the identifier making it every 100 million pages on average. I could make it every billion pages but that would increase the size of everything using the identifier, it's a tradeoff. -- GreenC 03:03, 17 March 2025 (UTC)
Archive.today short URLs
Just a quick question why there's a need to convert the short URL format to long form. I read WP:WEBARCHIVES#Archive.Today, but the reason for doing so doesn't seem to jump out at me. Could someone explain this in more detail? Thanks! --GoneIn60 (talk) 18:24, 17 March 2025 (UTC)
:It's URL shortening, which hides the actual URL, allowing bad actors to insert blacklisted URLs since the filters can't see the actual URL. It also helps real people to see the actual URL. And if/when archive.today ever closed down, we wouldn't know what the URL used to be (for square and bare links) and thus wouldn't be able to move to another archive provider. The WP:WEBARCHIVES link is a technical document meant for bot operators. Better information is at Help:Archiving a source and Help:Using archive.today specifically Help:Using_archive.today#Use_within_Wikipedia. -- GreenC 18:52, 17 March 2025 (UTC)
::That helps greatly, thank you! -- GoneIn60 (talk) 19:05, 17 March 2025 (UTC)
Issue with reference template:!
It seems the bot had an issue at this {{diff|Samuel Alito|1279362071|1277507056}} diff changing "|" to the
:Thanks for the report. The problem was a missing "]" Special:Diff/1283472039/1283518020 .. normally it should not matter since it's inside a Wikicomment but in this case it confused the bot which parsed up to the next available "]" which was 2 cites further down. -- GreenC 00:44, 2 April 2025 (UTC)
::My kingdom for a "]". Thanks for looking into it. meamemg (talk) 00:57, 2 April 2025 (UTC)
Stop linking newspapers
Can this bot be told to stop linking newspapers in references. It’s unnecessary and unwanted. - SchroCat (talk) 05:12, 2 April 2025 (UTC)
:MOS:REFLINK Οἶδα (talk) 21:36, 2 April 2025 (UTC)
::That isn’t an answer; let’s try and pretend we can write in sentences, shall we? Bots should not be doing tasks that are not beneficial, which this isn’t. I’ve reverted a few of the changes this made recently and put in a bots deny template, but it shouldn’t have been doing this in the first place. - SchroCat (talk) 04:44, 3 April 2025 (UTC)
:::Apologies, I was merely refuting your assertion that reflinks are "unnecessary and unwanted", which is not borne out by any direct evidence you have thus provided. But more importantly, you seem to be implying that the bot is going around performing edits in which the only revision is the addition of reflinks. That is simply not true in my experience, and not even true for the pages you reverted edits on. Οἶδα (talk) 09:03, 3 April 2025 (UTC)
::::Then you misunderstand. In addition to changing the URL (which is beneficial), it is also wikilinking the newspaper, which it should not be doing. - SchroCat (talk) 09:26, 3 April 2025 (UTC)
:::::No, you misunderstood my comment. You implied it is performing this for all newspapers and it is the ONLY revision being performed. When in reality, the addition of reflinks was alongside the migration of thetimes.co.uk URLs. That was not correct. You were wrong. Οἶδα (talk) 21:44, 3 April 2025 (UTC)
::::::I misunderstood nothing. You were wrong. - SchroCat (talk) 01:11, 4 April 2025 (UTC)
::User:GreenC, I have stopped the bot running as it is undertaking a task that is down to editorial discretion, and not something that should be done en masse by a bot. Just because wikilinking newspapers can be done, doesn't mean it should be done: that is not for an individual running a bot to decide, but for the editors on the individual articles. - SchroCat (talk) 07:03, 3 April 2025 (UTC)
:::Pardon me if I am wrong, but I believe this only applies to the url changes for The Times, which migrated from thetimes.co.uk to thetimes.com. Is it adding reflinks for other newspapers? Οἶδα (talk) 09:14, 3 April 2025 (UTC)
::::As above, you've misunderstood the comment. It's not about changing the URL, it's about adding in an unnecessary wikilink, which it should not be doing. - SchroCat (talk) 09:26, 3 April 2025 (UTC)
:::::You didn't respond to what I wrote (about The Times). But it doesn't matter anymore given the RfC. Οἶδα (talk) 21:45, 3 April 2025 (UTC)
Feature removed. -- GreenC 13:57, 3 April 2025 (UTC)
:RfC: Wikipedia:Village_pump_(proposals)#RfC:_work_field_and_reflinks -- GreenC 20:07, 3 April 2025 (UTC)
bot tripped-up by html comment markup
—Trappist the monk (talk) 19:24, 4 April 2025 (UTC)
:Thanks. I believe this is fixed at the core function. Sometimes wikicomments are within the bounds of an argument
, and sometimes cross over into another
. -- GreenC 22:54, 4 April 2025 (UTC)
tcm.com
The bot changed url-status from dead to live, eventhough the link is still dead: [https://en.wikipedia.org/w/index.php?title=List_of_Brian_Blessed_performances&diff=prev&oldid=1285133983]. Mika1h (talk) 16:39, 14 April 2025 (UTC)
:The new link https://www.tcm.com/tcmdb/person/17537%7C23930/brian-blessed#filmography is indeed live. You may be outside the USA, in which case there may be a regional block and there is nothing we can do about it, TCM is difficult that way. In this case, navigate the best you can. Recommend to click on the Wayback Machine link. -- GreenC 16:49, 14 April 2025 (UTC)
Bot error in [[Special:Diff/1287280421]]
The bot erroneously edited the archive link fields out of an empty ref template inside an invisible comment which was placed there for the convenience of editors. silviaASH (inquire within) 05:45, 25 April 2025 (UTC)