Wikipedia:Typo Team/moss/Archive#Statistics

DNA

DNA sequences, like those in

Hmm, I will have to ask around on MoS or something. Thanks for finding that. -- Beland (talk) 01:38, 19 July 2018 (UTC)

:If not, we could make one, a template with at bare minimum {{{1}}} Do similarly for the poem structure patterns. We did this with trade designations for horticultural plants, and it has worked out well: {{tl|tdes}}. Turn out the nomenclature authority requires them (in a scientific name) to be in a differenced font, so we used kerned monospace (it supports extra options, but that part was probably a bad idea). Anyway, [https://en.wikipedia.org/w/index.php?title=Special:Search&limit=500&offset=0&ns10=1&search=intitle%3Adna&advancedSearch-current=%7B%22namespaces%22%3A%5B10%5D%7D&searchToken=3s8sojjhrqykzj693lgjfamh2 here's] all "Template:"-namespace pages with "dna" in their titles [https://en.wikipedia.org/w/index.php?title=Special:Search&limit=500&offset=0&ns10=1&search=intitle%3Agene&advancedSearch-current=%7B%22namespaces%22%3A%5B10%5D%7D&searchToken=3gd9zep6jzejljw6l7am39o8m here's] those with "gene", in case there's already a template for this (I have not pored over them).  — SMcCandlish ¢ 😼  14:48, 21 July 2018 (UTC)

::Oof, that has resulted in some pretty ugly plant name typography; I wish we hadn't followed the typographical conventions of that source. I do like the idea of a template, though - that would make it easy for anyone who is interested to find all of the DNA sequences on Wikipedia...which is a thing that could happen? I put your code in {{tl|DNA sequence}} and applied that to this article; thanks for throwing that together! I'll ponder poem patterns a bit more. -- Beland (talk) 05:44, 25 July 2018 (UTC)

Fixed

These are due to difficult-to-parse mixtures of tables and templates. ::sigh:: I think I can fix this in code. -- Beland (talk) 00:49, 19 July 2018 (UTC)

:These should be ignored in the next run (20 July 2018 dump or later). -- Beland (talk) 22:28, 25 July 2018 (UTC)

:::ghola is either related to West Bengal, Pakistan, Afghanistan, or related to the Dune universe; Wiktionary does not have either wikt:ghola or wikt:gholas;

::::  24 - "gholas" : of 24 matches only one (Hasnabad (community development block)) is not from the Dune universe

::::427 - "ghola"

::::  59 - "ghola" -"bengal"

:::::of 59, only 7 are not about the Dune universe: Ghoul, Prem Pujari/List of songs recorded by Kishore Kumar (a song title) Mount Paiko/Kharkoo (places) Bogeyman List of rampage killers/List of rampage killers (familicides) (a town)

:::So this is the plural of a word that is most often a made-up term from the Dune universe, not exactly ready for Wiktionary! What to do? Shenme (talk) 19:12, 28 August 2018 (UTC)

::::Ah, we have a redirect from ghola; I can add redirects to the exclusion list. I'll have to be careful of those with {{tl|R from misspelling}} and variations, and we'll have to go through all untagged redirects and tag those that are also misspellings. (In the end, I think all redirects will be tagged; categorizing them helps projects decide whether or not they are worthy for inclusion in a print version or CD, etc.) -- Beland (talk) 19:53, 13 September 2018 (UTC)

:::::Oh, redirects are already included in the dictionary. I just created a redirect from gholas, so this should be ignored on the next run. -- Beland (talk) 04:43, 24 September 2018 (UTC)

Notes from Apr 2018

Poems

These are patterns used to describe poetry. Not sure they are appropriate for Wiktionary; if not, I will whitelist them. -- Beland (talk) 00:49, 19 July 2018 (UTC)

:There may be a better and even conventionally marked-up way to represent these. Check poetry sources? Maybe they done as c-d-c-d or whatever.  — SMcCandlish ¢ 😼  14:38, 21 July 2018 (UTC)

Oh, there are lots more where that came from. Maybe these should be tagged or maybe I can fix in code with a pattern recognizer or something. I'll have to ponder. -- Beland (talk) 01:42, 19 July 2018 (UTC)

From longest:

  • 63 [https://en.wikipedia.org/w/index.php?search=abacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacaba abacabadabacabaeabacabadabacabafabacabadabacabaeabacabadabacaba]
  • 48 [https://en.wikipedia.org/w/index.php?search=abaabaababaaabaabababaabaabaababaabaaababaabaaab abaabaababaaabaabababaabaabaababaabaaababaabaaab]

I think these should be capitalized or enclosed in quotes, either of which would prevent them as showing up here as spelling errors. I started a discussion at {{section link|Wikipedia talk:Manual of Style|Rhyme scheme patterns}}. -- Beland (talk) 22:26, 25 July 2018 (UTC)

:Continued at Wikipedia:Typo Team/moss#Repeating patterns. -- Beland (talk) 02:04, 17 August 2018 (UTC)

Notes from Jan 2019

  • 1 - Babak Hamidian - wikt:fakhrmousavi - this good enough? "fakhrmousavi" -> "Fakhrmousavi"
  • {{ping|Shenme}} Yes, that fixes it; capitalized words are assumed to be proper nouns and are ignored. -- Beland (talk) 05:28, 24 September 2018 (UTC)
  • 1 - Babakotia - wikt:antipronograde - wikt has wikt:pronograde, not the 'anti-' tho; see definitions [http://www.metaprimate.com/primate-glossary/#antipronograde antipronograde] and [http://www.metaprimate.com/primate-glossary/#pronograde pronograde]
  • 1 - Côn Sơn Island - wikt:oversped - When I corrected this one, a user on my talk page said it was a niche term used for "over-powering" a motor. References included at my talk pageeggofreasontalk 15:57, 20 December 2018 (UTC)

Statistics

=2018-04 to 2018-09=

class=wikitable
Misspellings
per article

!2018-04-01 dump
moss 4933ad4

!2018-07-01 dump
moss 4933ad4

!2018-07-20 dump
moss 5e6b2ce

!2018-08-01 dump
moss 0f7ddbf

!2018-08-20 dump
moss 032a6be

!2018-09-01 dump
moss 816c025

!2018-09-20 dump
moss 7e26fe6

!Total change
(to 2018-09-20)

048398894910541 (+70652)4948698 (+38157)4956727 (+8029)4975895 (+19168)4986531 (+10636)5066713 (+80182)(+226824)
1319509319315 (-194)315926 (-3389)312871 (-3055)311641 (-1230)309785 (-1856)268592 (-41193)(-50917)
2104405104591 (+186)90630 (-13961)89861 (-769)89701 (-160)89286 (-415)71105 (-18181)(-33300)
34027040099 (-171)38430 (-1669)37891 (-539)37832 (-59)37669 (-163)29796 (-7873)(-10474)
42279322739 (-54)21069 (-1670)20900 (-169)20909 (+9)20859 (-50)16180 (-4679)(-6613)
51335513331 (-24)12561 (-770)12392 (-169)12357 (-35)12315 (-42)9483 (-2832)(-3872)
693989422 (+24)8700 (-722)8620 (-80)8625 (+5)8574 (-51)6411 (-2163)(-2987)
765996614 (+15)6150 (-464)6095 (-55)6098 (+3)6076 (-22)4573 (-1503)(-2026)
853145312 (-2)4854 (-458)4832 (-22)4812 (-20)4839 (+27)3474 (-1365)(-1840)
939923985 (-7)3723 (-262)3643 (-80)3665 (+22)3631 (-34)2640 (-991)(-1352)
10-191675316879 (+126)15508 (-1371)15437 (-71)15497 (+60)15458 (-39)10260 (-5198)(-6493)
20-2949974992 (-5)4597 (-395)4594 (-3)4524 (-70)4512 (-12)2596 (-1916)(-2401)
30-3921692211 (+42)1976 (-225)1962 (-14)1934 (-28)1929 (-5)1011 (-918)(-1158)
40-4911771205 (+28)1061 (-144)1061 (0)1031 (-30)1027 (-4)525 (-502)(-652)
50-59674695 (+21)619 (-74)618 (-1)560 (-58)553 (-7)296 (-257)(-378)
60-69453476 (+23)420 (-56)419 (-1)378 (-41)377 (-1)179 (-198)(-274)
70-79299326 (+27)243 (- 83)241 (-2)214 (-27)218 (+4)119 (-99)(-180)
80-89213218 (+5)179 (-39)181 (+2)177 (-4)179 (+2)81 (-98)(-132)
90-99140153 (+13)131 (-21)126 (-5)126 (0)128 (+2)61 (-67)(-79)
100-199456521 (+65)434 (-87)435 (+1)416 (-19)414 (-2)196 (-218)(-260)
200-29990113 (+23)93 (-20)95 (+2)91 (-4)96 (+5)44 (-52)(-46)
300-3994445 (+1)41 (-4)42 (+1)41 (-1)42 (+1)27 (-15)(-17)
400-4991926 (+7)21 (-5)22 (+1)18 (-4)18 (0)9 (-9)(-10)
500-5991213 (+1)13 (0)13 (0)16 (+3)16 (0)7 (-9)(-5)
600-69989 (+1)9 (0)8 (-1)6 (-2)7 (+1)2 (-5)(-6)
700-79923 (+1)3 (0)5 (+2)5 (0)5 (0)1 (-4)(-1)
800-89923 (+1)3 (0)3 (0)3 (0)2 (-1)0 (-2)(-2)
900-99967 (+1)6 (-1)6 (0)5 (-1)5 (0)0 (-5)(-6)
1000-19992527 (+2)27 (0)26 (-1)24 (-2)23 (-1)0 (-23)(-25)
2000-299935 (+2)5 (0)5 (0)5 (0)3 (-2)0 (-3)(-3)
4000-499902 (+2)2 (0)2 (0)3 (+1)1 (-2)0 (-1)(0)
Parse failed193671194777 (+1106)191813 (-2964)195147 (+3334)203588 (+8441)203583 (-5)201420 (-2163)(+7749)

The spell checker has been getting smarter over time, so more recent versions report fewer false alarms. This explains most of the drop in the number of possible typos reported. Most of the gains for pages with more than 100 possible typos is due to changes that ignore pages with {{tl|cleanup}} and similar tags, which indicate the page may not be ready for spell checking. I have been specifically tagging pages with a high number of possible typos to bring them to the attention of interested editors. Pages tagged for cleanup are reported in the statistics of cleanup-related work queues.

Some variation in the number of typos fixed between runs is also explained by the differences in the amount of time between runs. The biggest sources of variance are the unusually long time between the first two runs and the fact that dumps snapshotted on the first day of the month (which have a lot of additional data the spell checker doesn't need) take longer for Wikimedia servers to generate than the dumps snapshotted on the twentieth day of the month. There is also considerable activity from other editors writing new material and correcting typos as they find them while reading or editing articles.

moss project participants have been correcting hundreds or thousands of typos per month (yay!) mostly in articles with a single typo. We have also been adding somewhere from handfuls to dozens of entries to Wiktionary a month. Looking only at the generated reports, these numbers are difficult to separate from the other changes in data and code, but we do see progress as we strike through or remove items from the todo lists.

Since figuring out which words are not typos is such a big part of the problem to be solved, the code may need to get smarter in the future, but we're probably going to have an upcoming period of relative stability as we work through some low-hanging fruit. Hopefully upcoming statistics will reflect progress in actually reducing typos more than changes in spell checker code. -- Beland (talk) 18:20, 12 October 2018 (UTC)

=2018-09 to 2019-03 =

At least 10% of possible typos reported in the old statistics are definitely misspellings, but it's unclear how many of the remaining 90% are. Below is a new way of breaking down possible typos, by type instead of count per article. The "T1" items are almost all typos, and those are what we've been working on in the main "by article" section. Some of the other types have their own reports on this page, but most will require further analysis to either automatically distinguish typos vs. legitimate strings, or produce a more useful report for human editors.

class="wikitable sortable"
Reporting symbol

!Explanation

!Instances/Unique strings, 2018-09-20 dump (7e26fe6)

!Instances/Unique strings, 2018-10-20 dump (7649023)

!Instances/Unique strings, 2018-11-01 dump (0aa8575)

!Instances/Unique strings, 2018-12-20 dump (03be966)

!Instances/Unique strings, 2019-01-20 dump (1bcf51c)

!Instances/Unique strings, 2019-02-01 dump (c6ce3ab)

!Instances/Unique strings, 2019-03-01 dump (ff8b9d2)

!Instances/Unique strings, 2019-03-20 dump (692642d)

TSMissing or whitespace or dash (or new compound)152985/84720194758/114535194711/114518195044/114675192811/114167193752/114734191701/113928183795/109989
T1Edit distance 1 from common English word111429/70527104280/68352103043/6765296081/6451389549/6101889355/6087983353/5748375941/53339
T2Edit distance 2 from common English word82638/5351781793/5314681721/5319181536/5309381170/5298082727/5394581410/5332672093/47849
T3Edit distance 3 from common English word91844/6133290769/6071390778/6076090382/6057489841/6039791893/6156690328/6082579609/54610
T4Edit distance 4 from common English word76336/5268475139/5209075006/5210174757/5182874536/5175276323/5293875335/52296-
T5Edit distance 5 from common English word52071/3645050970/3580750882/3581250614/3564950571/3562451785/3644650852/36022-
T6Edit distance 6 from common English word30437/2192729755/2148129704/2147829490/2130229440/2128030134/2175929685/21506-
T7Edit distance 7 from common English word15392/1109514972/1085414977/1085814858/1073614765/1069815153/1093914929/10790-
T8Edit distance 8 from common English word7138/50606966/49366970/49476911/49026863/48816967/49596811/4886-
T9Edit distance 9 from common English word2450/18682383/18232380/18222349/18222348/18192407/18672386/1848-
T10Edit distance 10 from common English word1027/721987/705986/706995/702978/697992/708960/693-
T11Edit distance 11 from common English word399/324390/317389/316380/312378/309386/315388/316-
T12Edit distance 12 from common English word122/105119/102119/102120/103117/101118/101118/101-
T13Edit distance 13 from common English word44/2944/2944/2944/2945/3045/3045/30-
T14Edit distance 14 from common English word15/1314/1214/1213/111/16/55/5-
T15Edit distance 15 from common English word1/11/11/10/01/10/00/0-
T16Edit distance 16 from common English word2/20/00/00/00/01/11/1-
RA-Z only, not near a common English word168446/121107165841/119452165960/119619165403/119208165091/119086169103/121936166235/120111101178/77389
ILetters with accents or mixed with punctuation (other than hyphen)266937/143960261310/144833261653/145040263654/145754263679/146027275444/153887229579/14930393902/70014
WNot in English Wikitionary, in non-English Wiktionary-------82548/48389
LProbable Romanization (transLiteration)-------4294/2610
MEProbable coMpound, English-------51279/33301
MIProbable coMpound, non-English (International) in English Wiktionary-------194949/133055
MWProbable coMpound, found in non-English Wiktionary-------51656/36961
MLProbable coMpound, transLiteration-------4010/2791
CChemistry words6581/46046597/46196613/46296631/46386633/46246618/46186637/46251853/1399
DDNA sequences (a, c, g, t)51/1815/316/416/415/315/32/20/0
NA-Z plus numbers and hyphens25061/2011425728/2085425702/2084625748/2089925582/2073726201/2125525969/2113026620/21685
PPatterns (e.g. rhyme schemes)808/461796/484790/484778/478736/439744/443493/42347/33
HHTML/XML/SGML tag------3389/15923519/1593
HBKnown bad HTML tag, like ------14417/4915366/49
HLBad HTML-like linking, like ------519/5516/5
Parse failureMismatched punctuation???202583203044203611214525199130 articles
Total1092214/6906391113627/7151481112459/7149271105804/7112321095150/7066711120169/7233341075547/7112961043175/695061

=2019-03 to 2020-02=

From 2018-09-20 to 2019-03-01, the number of typos classified as T1 (edit distance 1 from an English word, the most likely to be actual misspellings) dropped by 35,488, or 32%, and this appears to be due to the hard work of editors participating in the moss project fixing typos on the T1 lists. Amazing progress! The numbers for categories we aren't fixing have remained relatively stable, though for all categories there is some bouncing around as new typos are created and fixed in the normal course of writing and editing articles.

While processing the 2019-03-01 dump, I made a major change to how typos are classified. (You can see the old method in the archived statistics.) I've dropped categories with an edit distance greater than 3 from an English word (T4 thru T16) since these are quite unlikely to be misspellings. Most of the reported typos that are not likely English misspellings are either compound words or non-English words. (Some of the non-English words are also misspelled.) Some English compounds end up as TS, if they are caught by a conventional spell checker; the rest are now classified as ME. (There are various other categories for compounds, all starting with M, and these will all need to be refined later because a fair number of words are up there that don't belong.) In an effort to exclude as many non-English words as possible, I've started looking at non-English Wiktionaries; any words found there but not in the English Wiktionary are classified as W. Romanizations are not eligible for Wiktionary; words native to non-Latin writing systems are entered under those other systems. I've written some code that attempts to perform transliteration from any given writing system. It's starting to catch a few thousand words (classified as L) but is obviously missing a lot and so will need to be further refined. I've also added some categories for bad HTML tags and similar problems.

Since the classification changes make the new numbers incomparable with the old numbers, I've started a new table below. I've started posting some TS typos as well as T1s, so expect to see both those numbers to improve significantly in the coming months. -- Beland (talk) 07:30, 23 March 2019 (UTC)

class="wikitable sortable"
Reporting symbol

! Explanation

! Change from 2019-03-01 to 2020-02-20

! Instances, 2019-03-01 dump (692642d)

! Instances, 2019-03-20 dump (802b6c0)

! Instances, 2019-04-01 dump (ab3fabd)

! Instances, 2019-04-20 dump (7bb97ba)

! Instances, 2019-05-01 dump (dcb388a)

! Instances, 2019-05-20 dump (dcb388a)

! Instances, 2019-06-01 dump (30a59f6)

! Instances, 2019-07-01 dump (2fc381f)

! Instances, 2019-07-20 dump (41f99ab)

! Instances, 2019-08-01 dump (bc954d6)

! Instances, 2019-08-20 dump (c600526)

! Instances, 2019-09-01 dump (4660042)

! Instances, 2019-09-20 dump (18f7307)

! Instances, 2019-10-01 dump (08a1438)

! Instances, 2019-10-20 dump (e07a89f)

! Instances, 2019-11-01 dump (e07a89f)

! Instances, 2019-11-20 dump (e07a89f)

! Instances, 2019-12-01 dump (95d1a53)

! Instances, 2019-12-20 dump (0434c67)

! Instances, 2020-01-20 dump (99af116)

! Instances, 2020-02-20 dump (99af116)

bgcolor=red| TSMissing or extra whitespace or dash (or new compound)-39368 (-21%)183795182018 (-1777/.97%)178591 (-3427/1.9%)177391176266175163173312170828168401166966164205161344160707157832155980155218152621147666146591144424144427
bgcolor=red| T1Edit distance 1 from common English word-36192 (-48%)7594173600 (-2341/3.1%)70756 (-2844/3.9%)692616879066099647326125557141551605198748904459264427540436392853910639721393013873739749
bgcolor=red| T2Edit distance 2 from common English word-7560 (-10%)7209371615 (-478/.66%)70949 (-666/.93%)709097068470247697416962969365692666914668748686576716166173655896495264890648866469164533
bgcolor=red| T3Edit distance 3 from common English word-5276 (-7%)7960978925 (-684/.86%)78209 (-716/.91%)781397804677541769547688776672766917666375998760617509674636743277399574030745517441974333
bgcolor=yellow| RRegular word (A-Z only) not near a common English word-3525 (-3%)101178100067 (-1111/1.1%)99491 (-576/.58%)997229969499236988569878898646984989841197438975889686596775967469649096593969489734297653
bgcolor=yellow| IDefinitely not English (International) due to accents or mixed with punctuation (other than hyphen)-22196 (-24%)9390290875 (-3027/3.2%)88564 (-2311/2.5%)877488792584690810428128482263824128243171982712407024870349703857051070468707147085671706
bgcolor=lightblue| WNot in English Wiktionary, in non-English Wiktionary-6764 (-8%)8254882519 (-29/.04%)80041 (-2478/3.0%)796647948677888763107630976224761777614275508762487526374906748167485174991752947566375784
bgcolor=lightblue| LProbable Romanization (transLiteration)+81 (+2%)42944306 (+12/.28%)4206 (-100/2.3%)421942374197416841814189418841914191423441154126413241824195422842824375
bgcolor=lightblue| MEProbable coMpound, English (with and without dash)+976 (+2%)5127951052 (-227/.44%)50845 (-207/4.1%)509325090250659502635035250439504195070050606507085039251830517915178251830520265217352255
bgcolor=lightblue| MIProbable coMpound, non-English (International) in English Wiktionary (both A-Z and non-ASCII characters, with and without dash)-18475 (-9%)194949192743 (-2206/1.1%)189661 (-3082/1.6%)189758190172187870184497185101185733185960186074175904176069174746173592173700173611173710174881175528176474
bgcolor=lightblue| MWProbable coMpound, found in non-English Wiktionary-5544 (-11%)5165651240 (-416/.81%)50288 (-952/1.9%)500264978548728476414764247544478314755546854468504634246232460264594445968460314594746112
bgcolor=lightblue| MLProbable coMpound, transLiteration-124 (-3%)40103964 (-46/1.1%)3925 (-39/.98%)388138923835382938273826385738533849385237793750375937863798383438633886
bgcolor=lightblue| CChemistry words-176 (-9%)18531855 (+2/.11%)1863 (+8/.43%)186218581864156915591554156015611552155116651662165116351639165716621677
bgcolor=lightblue| DDNA sequences (a, c, g, t)000 (-)0 (-)000111000000000000
bgcolor=yellow| NA-Z plus numbers and hyphens-1391 (-5%)2662025854 (-766/2.8%)25711 (-143/.56%)257392626326134259452584125703256502566426664257762555725245250722494224993251192510725229
bgcolor=yellow| PPatterns (e.g. rhyme schemes)-20 (-43%)4750 (+3/6.4%)49 (-1/2.0%)504847504945423837391718161719211927
bgcolor=yellow| HHTML/XML/SGML tag-539 (-15%)35193459 (-60/1.7%)3423 (-36/1.0%)342034043237319731603173318031903059307830033016367330123019301929782980
bgcolor=red| HBKnown bad HTML tag, like -1080 (-7%)1536614837 (-529/3.4%)14541 (-296/2.0%)147761462216313162861681816816?1555814620155251526214494148911487215003151161416414286
bgcolor=red| HLBad HTML-like linking, like -98 (-19%)516510 (-6/1.2%)501 (-9/1.8%)500497492491496492493492474482459448449441441446433418
bgcolor=yellow| UURL-94 (-7%, from 2019-03-20)-12841242 (-42/3.3%)123512221225121812251227121312001219121311921197119611941199120511921190
bgcolor=yellow| BCBad characters-12678 (-6%, from 2019-09-01)-----------205046*196231194847194674194281192895192845192679192523192368
bgcolor=yellow| BWBad words-6542 (-5%, from 2019-09-20)-----------306181*120289*115983116073115612115522117419115418114602113747
Total-39115 (-3%, from 2019-09-20)1043175 instances1030773 instances (-12402/1.2%)1012856 instances (-17917/1.7%)100923210077939954659801029752329694549648289590611440178* instances1242324* instances1224099 instances1215612 instances1212615 instances1206360 instances1204437 instances1203965 instances1200605 instances1203209 instances
bgcolor=red| Parse failureMismatched punctuation-5145 (-3%)199130 articles200032 articles (+902/.45%)195598 articles (-4434/2.2%)195995 articles196330 articles196566 articles196882 articles197380 articles197810 articles198086 articles198442 articles158283 articles + 40465 MOS:STRAIGHT violations158564 articles + 40523 MOS:STRAIGHT violations151604 articles + 39214 MOS:STRAIGHT violations151827 articles + 39333 MOS:STRAIGHT violations152017 articles + 39428 MOS:STRAIGHT violations152167 articles + 39590 MOS:STRAIGHT violations152254 articles + 39727 MOS:STRAIGHT violations152557 articles + 39971 MOS:STRAIGHT violations152835 articles + 40112 MOS:STRAIGHT violations153494 articles + 40491 MOS:STRAIGHT violations

* Affected by significant algorithm changes. 1 Sep 2019: Added BC and BW. (Parse failures dropped due to JWB-powered MOS:STRAIGHT cleanup.) 20 Sep 2019: BC and BW restricted to lowercase; added TS+COMMA, TS+BRACKET, TS+EXTRA.

  • red = Probably need to fix
  • yellow = Unsorted
  • blue = Probably OK (but may need to verify)
  • bold = actively working on fixing

=2020 statistics=

In the year from March 2019 to March 2020, moss volunteers fixed over 94,000 typos! The most impressive progress is in the T1 category (single-letter misspellings), where we eliminated about half from the English Wikipedia. During this period we also started fixing missing spaces (focusing on those around punctuation) and those have dropped by about one-fifth. As we make progress, clear misspellings are increasingly mixed in with unclear cases; I'll be doing some more work on separation algorithms to keep the typo reports useful, so you'll probably see some more changes to typo classifications. Thanks to everyone who has been helping out! -- Beland (talk) 16:54, 28 April 2020 (UTC)

class="wikitable sortable"
Reporting symbol

! Explanation

! Change from 2019-03-01 to 2020-02-20

! Instances, 2020-04-01 dump (9f6d726)

! Instances, 2020-04-20 dump (5ff589d)

! Instances, 2020-05-01 dump (1a96ded)

! Instances, 2020-05-20 dump (e511f74)

! Instances, 2020-06-01 dump (509f79a)

! Instances, 2020-06-20 dump (825ceb4)

! Instances, 2020-07-01 dump (db9db23)

! Instances, 2020-07-20 dump (caa619f)

! Instances, 2020-08-01 dump (cf76e8c)

! Instances, 2020-08-20 dump (f104e58)

! Instances, 2020-09-01 dump (4654d88)

! Instances, 2020-09-20 dump (a26ccca)

! Instances, 2020-10-01 dump (686f5db)

! Instances, 2020-10-20 dump (4f90810)

! Instances, 2020-11-01 dump (ac54580)

! Instances, 2020-11-20 dump (6dbd61d)

! Instances, 2020-12-01 dump (917bcc8)

! Instances, 2020-12-20 dump (0b3409d)

bgcolor=red| TSMissing or extra whitespace or dash (or new compound)-39368 (-21%)145297144673331658**330624328249325399324179322282321801318621317183315825314747312110310537309386308280308977
bgcolor=red| T1Edit distance 1 from common English word-36192 (-48%)410904108139967394523878338379384363827137803367833597634036335393376432347330973355933427
bgcolor=red| T2Edit distance 2 from common English word-7560 (-10%)645266326360690603215958958603586495852158200580855784557329571525748757387575115738657348
bgcolor=red| T3Edit distance 3 from common English word-5276 (-7%)743967325570516700396888768192681496802067769677886748267226670256710167002672136729867399
bgcolor=yellow| RRegular word (A-Z only) not near a common English word-3525 (-3%)977269691694793938559325291537914899174691521917299151391613913399181392329932469337793493
bgcolor=yellow| IDefinitely not English (International) due to accents or mixed with punctuation (other than hyphen)-22196 (-24%)721516911865842648276363061844618886178261899621136191662003620496227462287623906223462471
bgcolor=yellow| WNot in English Wiktionary, in non-English Wiktionary-6764 (-8%)759137435186935856048317381894819468217381943821708191281968817928125681052812248113181192
bgcolor=lightblue| LProbable Romanization (transLiteration)+81 (+2%)443544864266419941204122410441134137414041514164416542074203423442404260
bgcolor=lightblue| MEProbable coMpound, English (with and without dash)+976 (+2%)522694876147187471534683046856469674716347052471704700947070470664704547023471934714247302
bgcolor=yellow| MIProbable coMpound, non-English (International) in English Wiktionary (both A-Z and non-ASCII characters, with and without dash)-18475 (-9%)177646176929171484169592166216164828165140165351165605166016166208166499166572167349167961169044168953169409
bgcolor=yellow| MWProbable coMpound, found in non-English Wiktionary-5544 (-11%)461134510343501429314043641383413254144041173412344099040956407954035340272404544041140338
bgcolor=lightblue| MLProbable coMpound, transLiteration-124 (-3%)390938743707366336723575358935933628363936583717372437793769382538303822
bgcolor=lightblue| CChemistry words-176 (-9%)178275647530764476407655765876597660766276547644765976617665765976747700
bgcolor=yellow| NA-Z plus numbers and hyphens-1391 (-5%)252092381322650225112229022020220522205321971220092196021923218792185621885218982189321943
bgcolor=yellow| ZDecimal fraction missing leading Zero-47*0*11405**114181141411398114021142111455115301154611578115981166911683117031172811762
bgcolor=yellow| PPatterns (e.g. rhyme schemes)-20 (-43%)27287977322454554555
bgcolor=yellow| HHTML/XML/SGML tag-539 (-15%)301028862938290329042848269326972680274727572729256525692542253825402572
bgcolor=red| HBKnown bad HTML tag, like -1080 (-7%)14465141211290313928129191473314022114281167011198101918860875688429725110881016410556
bgcolor=red| HLBad HTML-like linking, like -98 (-19%)414418377394394421408425420413373359356329324315318328
bgcolor=yellow| UURL-94 (-7%, from 2019-03-20)117911521118113411171122112911241120112411241103110110991091109610501055
bgcolor=yellow| BCBad characters-12678 (-6%, from 2019-09-01)192230190482186651186517185572178698175325166116159095124158112959112755112695112633112479110608110025109808
bgcolor=yellow| BWBad words-6542 (-5%, from 2019-09-20)113682106327381288**380259378710374982375107375206375431375306374622374740374560375010375008375557374989375663
Total-39115 (-3%, from 2019-09-20)1207516 instances1188601 instances1647413** instances1638977 instances1619804 instances1600496 instances1595660 instances1582586 instances1574035 instances1535639 instances1519034 instances1514101 instances1511139 instances1510211 instances1508575 instances1511284 instances1508227 instances1510830 instances
bgcolor=red| Parse failureMismatched punctuation-5145 (-3%)154084 articles + 40705 MOS:STRAIGHT violations153033 articles + 40838 MOS:STRAIGHT violations214365 articles + 37697 MOS:STRAIGHT violations214463 articles + 37667 MOS:STRAIGHT violations214101 articles + 37607 MOS:STRAIGHT violations214465 articles + 37767 MOS:STRAIGHT violations214732 articles + 37849 MOS:STRAIGHT violations215081 articles + 37993 MOS:STRAIGHT violations215447 articles + 38067 MOS:STRAIGHT violations215915 articles + 38169 MOS:STRAIGHT violations216227 articles + 38210 MOS:STRAIGHT violations216472 articles + 38205 MOS:STRAIGHT violations216738 articles + 38213 MOS:STRAIGHT violations216991 articles + 38246 MOS:STRAIGHT violations217192 articles + 38338 MOS:STRAIGHT violations217660 articles + 38498 MOS:STRAIGHT violations217861 articles + 38625 MOS:STRAIGHT violations| 218207 articles + 38789 MOS:STRAIGHT violations

  • red = Probably need to fix
  • yellow = Unsorted
  • blue = Probably OK (but may need to verify)
  • bold = actively working on fixing

* Identification of Z was broken

** Affected by major bug fix for counting inter-word typos (e.g. involving punctuation)

=2021 statistics=

class="wikitable sortable"
Dump (moss version)

! Parse failures (articles + articles with MOS:STRAIGHT violations)

! TOTAL (instances) || BC || BW || C || H || HB || HL || I || L || ME || MI || ML || MW || N || P || R || T1 || T2 || T3 || TS || U || W || Z || D

2021-01-01 (b4af24a)218317 + 3884115058081086613758757705255010726311625834262472741695043841401312195449337332968569036681930644510548111211753
2021-01-20 (a249b2d)218455 + 3893015069401080303760797679261611036298627464298470441702343885399602195949346733598566886668830677610428104911764
2021-02-01 (8279235)218833 + 3896015060041070003759797677259511729298628294305470531710053888397712197129372633237568226670730557310358107911723
2021-02-20 (2f00c51)218991 + 3903515040641065343759097682260211697275629424342470361713133897397322200939395932705565296661730446310208104111757
2021-03-01 (248159a)219198 + 391551494162106421376305766926249291281629784328468301696663876391892193649222132762561976606930237710208033811780
2021-03-20 (57aaae7)219556 + 393711492923106284375853769526109965278630554331470641704533880391722199829272132523560526608729975110028030511842
2021-04-01 (d47c725)219692 + 39478148487910567037575776972620885720562842430946966170369388438886219640925753216055810657062960099957973611862
2021-04-20 (d169566)220014 + 39634147647710450537454876862648886319962668432747036170547387838644219734923363056055284651912931709857948711938
2021-05-01 (7719363)219292 + 39601144581910325336723676612387768217859749396644397165787377438591216974914483066656556652572839679807863411949
2021-05-20 (c6359fc)219284 + 39761144457010279436825876782271787817659913397844514166538380438629217254918872920556341651712820939837865112079
2021-06-01 (076f14c)219111 + 39759144176910240936804676892275782716659876394344658166622381838567217555920772850756157649192806459757868212151
2021-06-20 (ffbc72f)219625 + 39935143533010192636752276942276710816259650396444692167038381938298216878923652802055983646882765389557862112316
2021-07-01 (cb3d5e8)219791 + 39990143341510191636758177042263692116959663396044770167508383738299216748926002736955755643012750249467872012427
2021-07-20 (5c3b9e9)220086 + 40132142962710151836795476882136670213759995395544805167818382438179216467926602646955565641712721479507862412677
2021-08-01 (86e7022)220338 + 40213142444810122936755277082123625212161727376744851168279381236769216430931462655555547641242714069537418912695
2021-08-20 (33a14e3)220370 + 40254141485410097336717277192047573611959520374644729167010381137772215372927632414654950635712667619607707512735
2021-09-01 (90e0a3b)220449 + 40268141119410011336711077142046580112059567373344623167222382437710215252928332331054796634552650449537692612767
2021-09-20 (c71a444)220781 + 4032814121409963536728677132040565012159595376644828167997384337719215610937012292454661635752647759487696612836
2021-10-01 (cdd699c)221094 + 40362140544899065367498768320605774111595463710445791673573831376962138129302722576542686313426146395276883128511

A major upgrade to word categorization was made in October 2021. The same dump is shown on the old and new systems for comparison. R, I, W, MI, MW, and ML were eliminated and sorted by language as TE or TF instead. New categories:

  • A = mAth
  • T/ = Suspected MOS:SLASH violation
  • TE = AI thinks it's trying to be English
  • TF = AI thinks it's trying to be a non-English language (Foreign to English Wikipedia), sorted by language (e.g. TF+el)

class="wikitable sortable"
Dump (moss version)

! Parse failures (articles + articles with MOS:STRAIGHT violations)

! TOTAL (instances) || A || BC || BW || C || H || HB || HL || L || ME || N || P || T/ || T1 || TE || TF || TS || U || Z

2021-10-01 (2ec07e4)221094 + 40362145764417030175488367537404920605774111542823795923293732375410810076439099118822164912851
2021-10-20 (b44e087)221396 + 404151452333224331737013817767762203253419553992194822351632525367910151438103112265161312892
2021-11-01 (0786728)221592 + 40396147699622385974234817997793157351229753992196382297932465354610145440061111957160712899
2021-11-20 (34069e9)153165 + 42992149100023808999454979957816160955871115688222435234093373535169847426498116119164212662
2021-12-01 (0fc2fb3)153177 + 42994148902523727997824969057828155856021045702222571234683359534059816425937116070162712678
2021-12-20 (d20f520)153289 + 42902148855023761990744969047845156156011085715223063235143337535809806425623115890161812709

=2022 statistics=

class="wikitable sortable"
Dump (moss version)

! Parse failures (articles + articles with MOS:STRAIGHT violations)

! TOTAL (instances) || A || BC || BW || C || D || H || HB || HL || L || ME || N || P || T/ || T1 || TE || TF || TS || U || Z

2022-01-01 (92506e2)153265 + 429191488043237309894949687278720156157121085744222842235583337530209801425923115845160812756
2022-01-20 (f63dc78)153371 + 428941490532237299843349731578751160361581085794223402234553325530579667426560116722159412839
2022-02-01 (8fbf720)153444 + 43002162162723804983664975517934115796051108600724021623811333345872411652531477117630159913200
2022-02-20 (8245233)153724 + 43135162245923835980834977667956116045177102599924049723701432815938411661531576118343161613194
2022-03-01 (8245233)153733 + 43208162442723837981074978557989115715815102602724078923711632785974411669531890118567160813191
2022-03-20 (fb66b79)153882 + 43327162450923823979614984667996115524746106605924119223631533116005811638531382119054160113185
2022-04-01 (fb66b79)153932 + 43430162645223823978284980858000115944793105606324171823751633276057211642532088119684159113147
2022-04-20 (fb66b79)154017 + 43596163048623789978414986118012116074990105606524294023741733376097711649532927120483158713174
2022-05-01 (fb66b79)153825 + 43698163128723793978014986328020116095048104607324330623842033376145311694533878119359157913196
2022-05-20 (cc63e5f)153870 + 43814163517423851977184980908043116364925107610324398623851933375955011866538310120406157413267
2022-05-20 (ae346b0)*164831 + 29862162079723846925224877928099116314930110607624485123081833356017011838538751119670158013269
2022-06-01 (6090418)164899 + 29887162020923786924024875128099116204620113609024501723091633316031811803538115120085158713385
2022-06-20 (97d23b9)164770 + 29816161795223775917994867128102016114705116608724519023191333005966611763538585119215156813426
2022-06-20 (1432a2f)†164877 + 29821167785523781918165475348102016114706116607124515323181332975965911764537643119292155413425
2022-07-01 (9ab6dad)164769 + 298551674273237329158554788181130164446571166110244376229514332615928611657535628118761155913469
2022-07-20 (06d752b)164636 + 298501674512236059117254755881110166348561266127244725229414432725885711659536841118429155013523
2022-08-01 (622271d)164730 + 298651675287235939091254759080800166049261276144244829228414532735890811604537355118773155313531
2022-08-20 (597dbd2)163908 + 298081667614235089056154471080810165351371216136243853228712232345816311473536597117099153513344
2022-08-20 (5ee7ffd)‡162500 + 2958012105781068186656540463798101611513612220731826721964114230743457658220607297829152213336
2022-08-20 (6965e1f)⹋162432 + 2956712058691066986557538964797901610513112220411814811963114229843278654020457597689152013338
2022-09-01 (cda0784)161909 + 294681198769106638616153644079900160353991201977180548194599227042927644520265196760148513286
2022-09-20 (4689b50)162154 + 295941199166106768592453659979810162167301251985180428195099226742279638320232796972148713333
2022-10-01 (e725bbd)161370 + 294501193722106468499953442979810162369881231964179378193499225942089635620154796530146613311
2022-10-20 (e725bbd)161347 + 2954611925911063284851534850799801623698712119811785001921101227141414626420135896915145413350
2022-11-01 (ebbea0e)161388 + 2960311924551063484376535156803601633650511619761785461917102227041341621720146397334145013383
2022-11-20 (84f0fc4)161548 + 2968311934781065984327535811811201614662211519701788171918102225941326618720118097563144413452
2022-12-01 (d57116b)161334 + 2974111936261065084229536307812401604650311019811788441913102226241018618120109097779144613483
2022-12-20 (003741b)161351 + 29828118903510658839725350958218015924957110197117883119171223641413617719880798124143113525

* ae346b0 started ignoring content inside curly quotes

† 1432a2f excluded more end sections

‡ 5ee7ffd started ignoring italicised content

⹋ 6965e1f started ignoring content inside single quotes

=2023 statistics=

class="wikitable"
Dump (moss version)

! Parse failures (articles + articles with MOS:STRAIGHT violations)

! TOTAL (instances) || A || BC || BW || C || D || H || HB || HL || L || ME || N || P || T/ || T1 || TE || TF || TS || U || Z

2023-01-01 (c2370a5)161163 + 29891118787010615839815342648233014984601110197517920619055222941525611519881497810142813556
2023-01-20 (36ce94e)161298 + 29949118283310598838135344118235015254965116195817857818896219638722605519844196321140213602
2023-02-01 (90a97fc)161048 + 29944118048510602838425341218245015005011111193617816318626218338247605019704796542139213625
2023-02-20 (f606b45)161111 + 30009118017610609836645347828249015095224108193017770918614207137810599719647897105138313683
2023-03-01 (75cbca7)161224 + 30095117937810613835705347928206015105286100191817756818605207637445597019636097010138213707
2023-03-20 (56a3811)161344 + 3016911770451056683245535523821401509520299191117695518615209236281581119630996321136113780
2023-04-01 (no run)
2023-04-20 (57a4619)161810 + 30162117815610577830765362158241015415473105190417585320435204936561574019652896979137013896
2023-05-01 (77de75d)162001 + 3015011718711041882887536140817001535463398189017306620285205036282578119508296960136113485
2023-05-20 (73bb66d)162329 + 3013811718171037982480536386816101470491388189017190520370206436364581719513297814136713550
2023-05-20 (d0a8560)163084 + 29893117026610186819555298118192014734902891879173759204212064380445842194194100920136613547
2023-06-01 (040dd4d)163371 + 29818116915010189814515296528200014745163901895172815203112052379975827193963101375136513610
2023-06-20 (50a82ce)163664 + 29771116973210189810865298928232015195624861879171891205012059383425785194184101817136413732
2023-07-01 (8533535)163877 + 29747116942010201809785296648242015645806831873171484204232061384465814193933102073137313780
2023-07-20 (9812c05)164115 + 29742117048210174804565298758255015535943801872171720203632057389565806194057102367136113911
2023-08-01 (7468187)164308 + 29748117092810136802305297398249015496036791873171743203752061391825811194411102497135113939
2023-08-20 (7170d29)164473 + 29635117193210148801375298048263015566132801874171627204882062392805856194769102930134414014

class="wikitable"

! Dump (moss version)

Parse failures (articles + articles with MOS:STRAIGHT violations)TOTAL (instances)ABCBWCDHHBHLLMENPT+gcld3_brokenT/T1TSUZ
2023-09-01 (8c03bd1)*164600 + 2959311731191013580154530301824501567569287187517182320619200991205739595103147133714043
2023-09-20 (8c03bd1)*164777 + 2961111730981018380123530578824001583477585187017171120648201138206439874103376133914087
2023-10-01 (d531b95)*164779 + 2958611731931016480017530906823801577471987186017130020619201083204739886103784132814127
2023-10-20 (9c53721)*164889 + 2966711735481017879977531174824313815844762871860171070204811201277204239910103702132314162
2023-11-01 (9c53721)*165069 + 2966811747101016479988531412825213815774738901844171440203311201449205940250103724133814203
2023-11-20 (1edb851)*165362 + 2974811770781019679995531684826213815974859931856171957203410202060205440847103797132314316
2023-12-01 (1edb851)*165429 + 2978811790431020879941531789829413816104950931867172253202812202513205641284104336131014361
2023-12-20 (1edb851)*165685 + 29862118018110205797625316328362138160348951031868172415202212203189204241499104750130114383

* Due to software issues, language detection wasn't working for this run.

Likely new words by frequency (non-English)

From 2019-02-01 dump:

From 2019-02-01 dump, but clearly not foreign words (need to figure out what to do with them):

  • 81 - wikt:₤100 - 19th-century London, Agio, Arthur Machen, Auckland Baptist Tabernacle, Australian native police ... [https://en.wikipedia.org/w/index.php?search=%22₤100%22&ns0=1 find all]
  • I should probably put in a code change to exclude money patterns like this. -- Beland (talk) 01:25, 30 August 2019 (UTC)
  • Ah, the problem is that ₤ (two horizontal bars) is not allowed by MOS:CURRENCY. It needs to be changed to £ (one horizontal bar) if it represents the British pound and it's unclear to me what to do for Italian lira. [https://en.wikipedia.org/w/index.php?sort=relevance&search=insource%3A%2F%E2%82%A4%2F&title=Special%3ASearch&ns0=1 find all ₤]-- Beland (talk) 06:17, 31 August 2019 (UTC)
  • OK, clarified with MOS folks the Lira should use ₤, so I'll clean up all the GBP instances that use ₤ and then add ₤ to moss's list of allowed currency symbols. -- Beland (talk) 16:08, 4 September 2019 (UTC)
  • I have changed all those that mean Pounds. Some that mean Lira still remain. Graeme Bartlett (talk) 04:03, 14 September 2019 (UTC)

Case notes from 2019-06-01 dump

  • 1 - QueA RNA motif - wikt:preQ --this appears as preQ1 which does have a Wiktionary entry, wikt:preQ1 so why is it included here?
  • Weird, I'll have to debug that. -- Beland (talk) 08:47, 16 June 2019 (UTC)
  • Oh, of course, because sup and sub tags cause text on either side to be in different tokens. I'll try changing that and see if it is an overall improvement. That should also fix things like chemical formulas, so I think it will be good. -- Beland (talk) 02:10, 9 May 2020 (UTC)
  • This is confirmed fixed. -- Beland (talk) 18:42, 27 May 2020 (UTC)