Automated Similarity Judgment Program

{{Short description|Computational comparative linguistics program}}

{{use mdy dates|date=September 2021}}

{{Use American English|date = February 2019}}

{{Infobox bibliographic database

| title = Automated Similarity Judgment Program

| image =

| caption =

| producer = Max Planck Institute for the Science of Human History

| country = Germany

| history =

| languages = English

| providers =

| cost = Free

| disciplines = Quantitative comparative linguistics

| depth =

| formats =

| temporal =

| geospatial =

| number =

| updates =

| p_title =

| p_dates =

| ISSN =

| web = {{URL|1=http://asjp.clld.org}}

| titles =

}}

The Automated Similarity Judgment Program (ASJP) is a collaborative project applying computational approaches to comparative linguistics using a database of word lists. The database is open access and consists of 40-item basic-vocabulary lists for well over half of the world's languages.{{Cite web |title=The ASJP Database - |url=https://asjp.clld.org/ |access-date=2024-02-15 |website=asjp.clld.org}} It is continuously being expanded. In addition to isolates and languages of demonstrated genealogical groups, the database includes pidgins, creoles, mixed languages, and constructed languages. Words of the database are transcribed into a simplified standard orthography (ASJPcode).{{Cite web |last1=Brown |first1=Cecil H |last2=Holman |first2=Eric W. |last3=Wichmann |first3=Søren |last4=Velupillai |first4=Viveka |date=2008 |title=Automated classification of the world's languages: A description of the method and preliminary results |url=https://www.researchgate.net/publication/40853551 |website=STUF – Language Typology and Universals}} The database has been used to estimate dates at which language families have diverged into daughter languages by a method related to but still different from glottochronology,{{Cite web |date=2011 |title=Automated dating of the world's language families based on lexical similarity |url=http://pubman.mpdl.mpg.de/pubman/item/escidoc:2395214/component/escidoc:2432001/shh768.pdf |website=pubman.mpdl.mpg.de}} to determine the homeland (Urheimat) of a proto-language,{{Cite web |date=2010 |title=Homelands of the world's language families: A quantitative approach |url=https://www.researchgate.net/publication/300467626 |website=www.researchgate.net}} to investigate sound symbolism,{{Cite journal |last1=Wichmann |first1=Søren |last2=Holman |first2=Eric W. |last3=Brown |first3=Cecil H. |date=April 2010 |title=Sound Symbolism in Basic Vocabulary |journal=Entropy |language=en |volume=12 |issue=4 |pages=844–858 |doi=10.3390/e12040844 |doi-access=free |issn=1099-4300}} to evaluate different phylogenetic methods,{{Cite journal |last1=Pompei |first1=Simone |last2=Loreto |first2=Vittorio |last3=Tria |first3=Francesca |date=2011-06-03 |title=On the Accuracy of Language Trees |journal=PLOS ONE |language=en |volume=6 |issue=6 |pages=e20109 |doi=10.1371/journal.pone.0020109 |doi-access=free |issn=1932-6203 |pmc=3108590 |pmid=21674034|arxiv=1103.4012 |bibcode=2011PLoSO...620109P }} and several other purposes.

ASJP is not widely accepted among historical linguists as an adequate method to establish or evaluate relationships between language families.Cf. comments by Adelaar, Blust and Campbell in Holman, Eric W., et al. (2011) "Automated Dating of the World’s Language Families Based on Lexical Similarity." Current Anthropology, vol. 52, no. 6, pp. 841–875.

It is part of the Cross-Linguistic Linked Data project hosted by the Max Planck Institute for the Science of Human History.{{cite web| url=http://clld.org | title=Cross-Linguistic Linked Data |access-date=2020-02-22}}

History

= Original goals =

ASJP was originally developed as a means for objectively evaluating the similarity of words with the same meaning from different languages, with the ultimate goal of classifying languages computationally, based on the lexical similarities observed. In the first ASJP paper two semantically identical words from compared languages were judged similar if they showed at least two identical sound segments. Similarity between the two languages was calculated as a percentage of the total number of words compared that were judged as similar. This method was applied to 100-item word lists for 250 languages from language families including Austroasiatic, Indo-European, Mayan, and Muskogean.

= ASJP Consortium =

The ASJP Consortium, founded around 2008,{{when|date=June 2018}} came to involve around 25 professional linguists and other interested parties working as volunteer transcribers and/or extending aid to the project in other ways. The main driving force behind the founding of the consortium was Cecil H. Brown. Søren Wichmann is daily curator of the project. A third central member of the consortium is Eric W. Holman, who has created most of the software used in the project.

= Shorter word lists =

While word lists used were originally based on the 100-item Swadesh list, it was statistically determined that a subset of 40 of the 100 items produced just as good if not slightly better classificatory results than the whole list.{{Cite web |last1=Holman |first1=Eric W. |last2=Wichmann |first2=Søren |last3=Brown |first3=Cecil H. |last4=Velupillai |first4=Viveka |last5=Müller |first5=André |last6=Bakker |first6=Dik |date=2008 |title=Explorations in automated language classification |url=https://www.researchgate.net/publication/40853552 |website=Folia Linguistica}} So subsequently word lists gathered contain only 40 items (or less, when attestations for some are lacking).

= Levenshtein distance =

In papers published since 2008, ASJP has employed a similarity judgment program based on Levenshtein distance (LD). This approach was found to produce better classificatory results measured against expert opinion than the method used initially. LD is defined as the minimum number of successive changes necessary to convert one word into another, where each change is the insertion, deletion, or substitution of a symbol. Within the Levenshtein approach, differences in word length can be corrected for by dividing LD by the number of symbols of the longer of the two compared words. This produces normalized LD (LDN). An LDN divided (LDND) between the two languages is calculated by dividing the average LDN for all the word pairs involving the same meaning by the average LDN for all the word pairs involving different meanings. This second normalization is intended to correct for chance similarity.Wichmann, Søren, Eric W. Holman, Dik Bakker, and Cecil H. Brown. 2010. Evaluating linguistic distance measures. Physica A 389: 3632-3639 ({{doi|10.1016/j.physa.2010.05.011}}).

Word list

The ASJP uses the following 40-word list.{{Cite web |title=Guidelines |url=http://asjp.clld.org/static/Guidelines.pdf |website=asjp.clld.org}} It is similar to the Swadesh–Yakhontov list, but has some differences.

{{div col|colwidth=22em}}

;Body parts

  • eye
  • ear
  • nose
  • tongue
  • tooth
  • hand
  • knee
  • blood
  • bone
  • breast (woman’s)
  • liver
  • skin

;Animals and plants

  • louse
  • dog
  • fish (noun)
  • horn (animal part)
  • tree
  • leaf

;People

  • person
  • name (noun)

;Nature

  • sun
  • star
  • water
  • fire
  • stone
  • path
  • mountain
  • night (dark time)

;Verbs and adjectives

  • drink (verb)
  • die
  • see
  • hear
  • come
  • new
  • full

;Numerals and pronouns

  • one
  • two
  • I
  • you
  • we

{{div col end}}

ASJPcode

ASJP version from 2016{{Citation needed|date=November 2022|reason=The list cites a 2008 reference}} uses the following symbols to encode phonemes: p b f v m w 8 t d s z c n r l S Z C j T 5 y k g x N q X h 7 L 4 G ! i e E 3 a u o

They represent 7 vowels and 34 consonants, all found on the standard QWERTY keyboard.

class="wikitable sortable"

|+ Sounds represented by ASJPcode

! ASJPcode !! Description !! IPA

ihigh front vowel, rounded and unrounded{{IPA|i, ɪ, y, ʏ}}
emid front vowel, rounded and unrounded{{IPA|e, ø}}
Elow front vowel, rounded and unrounded{{IPA|a, æ, ɛ, ɶ, œ, e}}
3high and mid central vowel, rounded and unrounded{{IPA|ɨ, ɘ, ə, ɜ, ʉ, ɵ, ɞ}}
alow central vowel, unrounded{{IPA|ɐ}}, ä
uhigh back vowel, rounded and unrounded{{IPA|ɯ, u, ʊ}}
omid and low back vowel, rounded and unrounded{{IPA|ɤ, ʌ, ɑ, o, ɔ, ɒ}}
pvoiceless bilabial stop and fricative{{IPA|p, ɸ}}
bvoiced bilabial stop and fricative{{IPA|b, β}}
mbilabial nasal{{IPA|m}}
fvoiceless labiodental fricative{{IPA|f}}
vvoiced labiodental fricative{{IPA|v}}
8voiceless and voiced dental fricative{{IPA|θ, ð}}
4dental nasal{{IPA|n̪}}
tvoiceless alveolar stop{{IPA|t}}
dvoiced alveolar stop{{IPA|d}}
svoiceless alveolar fricative{{IPA|s}}
zvoiced alveolar fricative{{IPA|z}}
cvoiceless and voiced alveolar affricate{{IPA|t͡s, d͡z}}
nvoiceless and voiced alveolar nasal{{IPA|n}}
Svoiceless postalveolar fricative{{IPA|ʃ}}
Zvoiced postalveolar fricative{{IPA|ʒ}}
Cvoiceless palato-alveolar affricate{{IPA|t͡ʃ}}
jvoiced palato-alveolar affricate{{IPA|d͡ʒ}}
Tvoiceless and voiced palatal stop{{IPA|c, ɟ}}
5palatal nasal{{IPA|ɲ}}
kvoiceless velar stop{{IPA|k}}
gvoiced velar stop{{IPA|ɡ}}
xvoiceless and voiced velar fricative{{IPA|x, ɣ}}
Nvelar nasal{{IPA|ŋ}}
qvoiceless uvular stop{{IPA|q}}
Gvoiced uvular stop{{IPA|ɢ}}
Xvoiceless and voiced uvular fricative, voiceless and voiced pharyngeal fricative{{IPA|χ, ʁ, ħ, ʕ}}
7voiceless glottal stop{{IPA|ʔ}}
hvoiceless and voiced glottal fricative{{IPA|h, ɦ}}
lvoiced alveolar lateral approximant{{IPA|l}}
Lall other laterals{{IPA|ʟ, ɭ, ʎ}}
wvoiced bilabial-velar approximant{{IPA|w}}
ypalatal approximant{{IPA|j}}
rvoiced apico-alveolar trill and all varieties of “r-sounds”{{IPA|r, ʀ,}} etc.
!all varieties of “click-sounds”{{IPA|ǃ, ǀ, ǁ, ǂ}}

A {{code|~}} mark follows two consonants so that they are considered to be in the same position.

Thus, {{IPA|kʷat}} becomes {{code|kw~at}}.

Syllables like {{code|kat}}, {{code|wat}}, {{code|kaw}} and {{code|kwi}} are considered lexically similar to {{code|kw~at}}.

Similarly, a {{code|$}} mark follows three consonants so that they are considered to be in the same position.

{{code|ndy$im}} is considered similar to {{code|nim}}, {{code|dam}} and {{code|yim}}.

{{code|"}} marks the preceding consonant as glottalized.

See also

References

{{Reflist}}

Sources

  • Søren Wichmann, Jeff Good (eds). 2014. [https://books.google.com/books?id=NJ6XCgAAQBAJ&dq=ASJP+code&pg=PA203 Quantifying Language Dynamics: On the Cutting edge of Areal and Phylogenetic Linguistics], p. 203. Leiden: Brill.
  • Brown, Cecil H., et al. 2008. [https://www.researchgate.net/profile/Soren_Wichmann/publication/40853551_Automated_Classification_of_the_World%27s_Languages_A_Description_of_the_Method_and_Preliminary_Results/links/546373360cf2837efdb30a6e/Automated-Classification-of-the-Worlds-Languages-A-Description-of-the-Method-and-Preliminary-Results.pdf Automated Classification of the World's Languages: A Description of the Method and Preliminary Results]. Language Typology and Universals 61(4). November 2008. {{doi|10.1524/stuf.2008.0026}}
  • Wichmann, Søren, Eric W. Holman, and Cecil H. Brown (eds.). 2018. [http://asjp.clld.org/ The ASJP Database] (version 18).