Ancient text corpora
{{short description|All known writing up to 300 CE}}
{{Use shortened footnotes|date=June 2023}}
File:Library of Ashurbanipal.jpg, one of the largest components of the Akkadian corpus]]
Ancient text corpora are the entire collection of texts from the period of ancient history, defined in this article as the period from the beginning of writing up to 300 AD. These corpora are important for the study of literature, history, linguistics, and other fields, and are a fundamental component of the world's cultural heritage.
Chinese, Latin, and Greek are examples of ancient languages with significant text corpora, although much of these corpora are known to us via transmission (frequently via medieval manuscript copies) rather than in their original form. These texts – both transmitted and original – provide valuable insights into the history and culture of different regions of the world, and have been studied for centuries by scholars and researchers. Other ancient texts – particularly stone inscriptions and papyrus scrolls – have been published following archaeological research, notably the cuneiform corpus of {{circa}}10 million words and the {{circa}}5 million words in ancient Egyptian.
Through advances in technology and digitization, ancient text corpora are more accessible than ever before. Tools such as the Perseus Digital Library and the Digital Corpus of Sanskrit{{Cite web |title=Digital Corpus of Sanskrit (DCS) - Online Sanskrit dictionary and annotated corpus |url=http://www.sanskrit-linguistics.org/dcs/ |access-date=2023-06-03 |website=www.sanskrit-linguistics.org |archive-date=2023-06-03 |archive-url=https://web.archive.org/web/20230603171957/http://www.sanskrit-linguistics.org/dcs/ |url-status=live }} have made it easier for researchers to access and analyze these texts.
Quantifying the corpora
Two types of ancient texts are known to modern scholars – those that have only survived in younger manuscripts, but whose great age is undisputed (this applies to the bulk of the Chinese, Brahmi, Greek, Latin, Hebrew and Avestan tradition), and those known from original inscriptions, papyri and other manuscripts.{{sfn|Peust|2000|pp=252-253}}
Counting of the words in each corpus presents significant methodological challenges – in principle, every single occurrence of a word in the text is counted separately, but in the case of parallel transmission of literary texts, only a single transmission is taken into account. Just as the Book of the Dead and the coffin texts are only included once in the number given for the Egyptian, the Greek and Latin literary works should only be counted according to one manuscript. If, on the other hand, tombs, royal inscriptions or economic documents of certain ancient languages often show a more or less identical form, this is not evaluated as a purely "parallel tradition". Attached prepositions are counted as separate words, except in the case of the definite article in Hebrew, Aramaic and Greek since it has no equivalent in most languages, so its frequency would significantly affect the comparability of numbers.{{sfn|Peust|2000|pp=252-253}}
=Languages with known size estimates=
=South Asian=
- Sanskrit (Vedic Sanskrit and Classical Sanskrit)
- Indus script (3,800 items, c.20,000 characters){{cite journal |last1=Rao |first1=Rajesh P. N. |last2=Yadav |first2=Nisha |last3=Vahia |first3=Mayank N. |last4=Joglekar |first4=Hrishikesh |last5=Adhikari |first5=R. |last6=Mahadevan |first6=Iravatham |title=A Markov model of the Indus script |journal=Proceedings of the National Academy of Sciences |date=18 August 2009 |volume=106 |issue=33 |pages=13685–13690 |doi=10.1073/pnas.0906237106 |pmid=19666571 |pmc=2721819 |bibcode=2009PNAS..10613685R |language=en |issn=0027-8424 |doi-access=free }}
- Brahmi script
- Old Tamil
- Early Indian epigraphy and Indian epic poetry
- Kharosthi[https://stefanbaums.com/publications/baums_glass_2013.pdf Die Lexikographie der Gandharī-Sprache] {{Webarchive|url=https://web.archive.org/web/20230408130720/https://stefanbaums.com/publications/baums_glass_2013.pdf |date=2023-04-08 }}, Akademie Aktuell Jahrgang 2013 - Ausgabe Nr. 44, 44-47: "Seit dem erscheinen von Baileys Artikel ist die Materialgrundlage für die Ga-ndha-rī-Lexiko- graphie durch umfangreiche neufunde von handschriften – aber auch inschriften, Verwal- tungsdokumenten und Münzen – um ein Viel- faches angewachsen: Der von uns erstellte catalog of Ga-ndha-rī Texts (http://gandhari.org/ {{Webarchive|url=https://web.archive.org/web/20131205201723/http://gandhari.org/ |date=2013-12-05 }} catalog) verzeichnet derzeit 77 umfangreiche Schriftrollen, 330 handschriftenfragmente, 834 inschriften, 792 niya-Dokumente und 335 unterschiedliche Münzlegenden mit einem ge- schätzten Textbestand von insgesamt 120.000 Wortbelegen."
- Pali literature{{cite book | last=Kingsbury | first=P. | title=The Chronology of the Pali Canon: The Case of the Aorists | publisher=University of Pennsylvania | year=2002 | isbn=978-0-493-92911-8 | url=https://books.google.com/books?id=ZHmmnQEACAAJ | access-date=2023-05-03 | quote= The early Buddhist canon written in Pali comprises some 4 million words of text written across several centuries in early India. As such, it is of interest not only to scholars of Buddhism but also linguists and historians for the insight it gives into the social, linguistic, and religious culture of the time.}}
- List of historic Indian texts
=Mesoamerican=
=East Asian=
- Old Chinese
- Chinese classics
- The pre-Qin corpus: a collection of ancient Chinese texts written before the Qin dynasty (221 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought.
- The pre-Han corpus: a collection of ancient Chinese texts written before the Han dynasty (202 BCE). The corpus includes texts from Confucianism, Taoism, Legalism, and other schools of thought.
- See the Chinese Text Project
- Chinese bronze inscriptions, Oracle bone script, Seal script, Clerical script
=Central Iranian languages=
- Prior to 300 AD, the Central Iranian languages are mainly in the form of Sassanid stone inscriptions in the two closely related idioms Middle Persian (Pahlavi scripts and Inscriptional Parthian),{{cite book | last=Gignoux | first=P. | title=Corpus Inscriptionum Iranicarum: Glossaire des inscriptions pehlevies et parthes | publisher=School of Oriental and African Studies | year=1972 | url=https://books.google.com/books?id=ICdUmwEACAAJ | language=fr | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184742/https://books.google.com/books?id=ICdUmwEACAAJ | url-status=live }} there are 5000 for the corpus of Middle Persian (mostly 3rd, but also 4th/5th centuries) and for the corpus of Parthian (3rd century) 3000 words. To what extent some of the Manichaean Middle Persian literary texts may date back to the 3rd century is difficult to estimate; Mani is said to have personally written the ShabuhraganMacKenzie, D. N., and Mani. “Mani’s ‘Šābuhragān.’” Bulletin of the School of Oriental and African Studies, University of London 42, no. 3 (1979): 500–534. http://www.jstor.org/stable/615572 {{Webarchive|url=https://web.archive.org/web/20230505014730/https://www.jstor.org/stable/615572 |date=2023-05-05 }}.; and “Mani’s ‘Šābuhragān’--II.” Bulletin of the School of Oriental and African Studies, University of London 43, no. 2 (1980): 288–310. http://www.jstor.org/stable/616043 {{Webarchive|url=https://web.archive.org/web/20221008091936/https://www.jstor.org/stable/616043 |date=2022-10-08 }} and {{cite book | last=Hutter | first=M. | title=Manis Kosmogonische Šābuhragān-Texte: Edition, Kommentar und literaturgeschichtliche Einordnung der manichäisch-mittelpersischen Handschriften M 98/99 I und M 7980-7984 | publisher=Otto Harrassowitz | series=Studies in Oriental religions | year=1992 | isbn=978-3-447-03227-8 | url=https://books.google.com/books?id=EQmnVWCPCrQC | language=de | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184741/https://books.google.com/books?id=EQmnVWCPCrQC | url-status=live }} totaling about 5000 words. In any case, if we combine Middle Persian and Parthian, we come to over 10,000 words.{{sfn|Peust|2000|pp=257-258}}
=Proto-Sinaitic=
- Proto-Sinaitic script has no more than about 400 letters (number of words is unknown since the script has not been fully interpreted).{{cite book | last=Sass | first=Benjamin | title=The genesis of the alphabet and its development in the second millen̄ium B.C. | publisher=In Kommission bei O. Harrassowitz | publication-place=Wiesbaden | date=1988 | isbn=3-447-02860-2 | oclc=21033775}} To a similar extent, there are probably approximately contemporaneous Proto-Canaanite inscriptions (ibid.).{{sfn|Peust|2000|pp=257}}
=Anatolian=
- Luwian cuneiform,{{cite book | last=Starke | first=Frank | title=Die keilschrift-luwischen Texte in Umschrift | publisher=O. Harrassowitz | publication-place=Wiesbaden | date=1985 | isbn=3-447-02349-X | oclc=12170509 | language=de}} approx. 3000 words{{sfn|Peust|2000|pp=255}}
- the Palaic language{{cite book | last=Carruba | first=O. | title=Das Palaische | publisher=Harrassowitz | series=Studien zu den Bogazkoy-Texten; hrsg. von der Kommission fur den Alten Orient der Akademie der Wissenschaften und der Literatur, Heft 10 | year=1970 | isbn=978-3-447-01283-6 | url=https://books.google.com/books?id=FSc26gcqaZgC | language=de | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184753/https://books.google.com/books?id=FSc26gcqaZgC | url-status=live }} few hundred words.{{sfn|Peust|2000|pp=255}}
- Hieroglyphic LuwianThe relevant corpus of hieroglyphic Luwian inscriptions was published by H. Cambel after Peust's article{{sfn|Peust|2000|pp=255}}
- the Lycian alphabet (the best attested Anatolian successor language written in alphabetic script){{cite book | title=Tituli Asiae Minoris: Tituli Lyciae lingua lycia conscripti | publisher=R.M. Rohrer | year=1901 | url=https://archive.org/details/gri_33125010455224/mode/1up }} and {{cite book | last=Neumann | first=G. | title=Neufunde lykischer Inschriften seit 1901 | publisher=Verlag der Österreichischen Akademie der Wissenschaften | series=Denkschriften (Österreichische Akademie der Wissenschaften. Philosophisch-Historische Klasse) | issue=v. 135-137 | year=1979 | isbn=978-3-7001-0283-0 | url=https://books.google.com/books?id=0Ip-ugEACAAJ | language=de | access-date=2023-05-01 | archive-date=2023-05-13 | archive-url=https://web.archive.org/web/20230513091841/https://books.google.com/books?id=0Ip-ugEACAAJ | url-status=live }} with about 5000 words{{sfn|Peust|2000|pp=255}}
- The Lydian alphabet{{cite book|title = Lydisches Wörterbuch. Mit grammatischer Skizze und Inschriftensammlung |author= Roberto Gusmani |publisher= Ergänzungsband 1-3, Heidelberg | year= 1980–1986 |language=German}} and {{cite book |last1=Gusmani |first1=Roberto |title=Lydisches Wörterbuch |date=1964 |publisher=C. Winter |oclc=582362214 |url=https://archive.org/details/gusmani-lydisches-worterbuch-1964 |language=de }} 109 inscriptions comprising about 1500 words{{sfn|Peust|2000|pp=255}}
- The Phrygian alphabet the in-tomb inscriptions from the 2nd and 3rd centuries AD{{cite book | last=Haas | first=O. | title=Die phrygischen Sprachdenkmäler | publisher=Académie bulgare des sciences | series=Académie bulgare des sciences linguistiques balkanique | year=1966 | url=https://books.google.com/books?id=buydQAAACAAJ | language=de | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184815/https://books.google.com/books?id=buydQAAACAAJ | url-status=live }} (approx. 1000 words) and in the so-called "old Phrygian" inscriptions{{cite book | last1=Brixhe | first1=C. | last2=Lejeune | first2=M. | title=Corpus des inscriptions paléo-phrygiennes | publisher=Editions Recherche sur les civilisations | issue=v. 1-2 | year=1984 | isbn=978-2-86538-089-3 | url= https://archive.org/details/corpusdesinscrip0001brix | language=fr | access-date=2023-05-01}} less than 300 words{{sfn|Peust|2000|pp=255}}
- The Carian alphabets{{cite book | last1=Lajara | first1=I.J.A. | last2=Neumann | first2=G. | title=Studia carica: investigaciones sobre la escritura y lengua carias | publisher=PPU | year=1993 | isbn=978-84-477-0236-7 | url=https://books.google.com/books?id=12uzPAAACAAJ | language=es | access-date=2023-05-01 | archive-date=2023-05-13 | archive-url=https://web.archive.org/web/20230513091843/https://books.google.com/books?id=12uzPAAACAAJ | url-status=live }} whose texts, mainly from Egypt, contain around 600 words.{{sfn|Peust|2000|pp=255}}
=Old Italic=
- the Umbrian language{{cite book | last=Vetter | first=E. | title=Handbuch der italischen Dialekte | publisher=C. Winter | series=1. Reihe: Lehr und Handbücher | issue=v. 2 | year=1953 | isbn=978-3-8253-5952-2 | url=https://books.google.com/books?id=iw8MMQAACAAJ | language=de | access-date=2023-05-01 | page= | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184745/https://books.google.com/books?id=iw8MMQAACAAJ | url-status=live }} attested essentially by the sacrificial instructions of the Iguvinian Tables with 5000 words{{sfn|Peust|2000|pp=258}}
- the Oscan language (ibid.) with 2000 words{{sfn|Peust|2000|pp=258}}
- the Messapic languageO. Haas, Messapische Studien, Heidelberg 1962 and C. Santoro, Nuovi studi messapici, 2 vols., Lecce 1982/3 and Supplement 1984 with probably a good 1000 words (the estimate is difficult because most texts in this hardly understandable language do not use word separators){{sfn|Peust|2000|pp=258}}
- the Venetic language{{cite book | last=Lejeune | first=M. | title=Manuel de la langue vénète | publisher=Winter | series=Indogermanische Bibliothek / Lehr- und Handbücher | issue=v. 59 | year=1974 | isbn=978-3-533-02353-1 | url= https://archive.org/details/manueldelalangue0000leje/page/n4/mode/1up | language=fr }} a few hundred words{{sfn|Peust|2000|pp=258}}
- the Faliscan language{{cite book | last=Giacomelli | first=G. | title=La lingua falisca | publisher=L.S. Olschki | series=Biblioteca di "Studi etruschi" | year=1963 | url= https://archive.org/details/lalinguafalisca0000giac | language=it }} a few hundred words{{sfn|Peust|2000|pp=258}}
- Cisalpine Celtic inscriptions amount to approximately 2000 words, to which are added a number of glosses by classical authors{{cite book | last=Whatmough | first=Joshua | title=Dialects of Ancient Gaul. | publisher=HUP | publication-place=Cambridge | date=1969 | isbn=978-0-674-86413-9 | oclc=935283757 | url= https://archive.org/details/dialectsofancien0000what}}{{sfn|Peust|2000|pp=258}}
=Iberia=
- Iberian scripts, more rarely written in Greek or Latin script, approx. 2500 words{{sfn|Peust|2000|pp=258}}{{cite book | last=Untermann | first=J. | authorlink=Jürgen Untermann | title=Monumenta linguarum Hispanicarum | publisher=Ludwig Reichert Verlag | issue=6 volumes | year=1975 | url=https://books.google.com/books?id=uC3azgEACAAJ | language=de | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184739/https://books.google.com/books?id=uC3azgEACAAJ | url-status=live }}
- Celtiberian script, which refers to Celtic language testimonies in Iberian, but also in Latin script from Spain (approx. 1000 words){{sfn|Peust|2000|pp=258}}
- Southwest Paleohispanic script, 78 inscriptions, a few hundred words{{sfn|Peust|2000|pp=258}}
- Lusitanian language, three monuments in Latin script, approx. 60 words{{sfn|Peust|2000|pp=258}}
=Germanic Northern Europe=
- Runic inscriptions dated before the 4th century amount to about 30 pieces, which contain no more than 50 words in total{{cite book | last1=Krause | first1=W. | last2=Jankuhn | first2=H. | title=Die Runeninschriften im älteren Futhark | publisher=Vandenhoeck u. Ruprecht | series=Abhandlungen der Akademie der Wissenschaften in Göttingen, Philologisch-Historische Klasse | issue=v. 1-2 | year=1966 | url=https://books.google.com/books?id=3O4dAQAAIAAJ | language=de | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184747/https://books.google.com/books?id=3O4dAQAAIAAJ | url-status=live }} and M. Stoklund, Neue Runenfunde in Illerup and Vimose, in Germania 64, 1986, 75ff{{sfn|Peust|2000|pp=259}}
=Africa=
- Geʽez script: comparatively few inscriptions with a total of around 1,000 words before 300 AD.{{cite book | last1=Bernand | first1=E. | last2=Drewes | first2=A.J. | last3=Schneider | first3=R. | title=Recueil des inscriptions de l'Ethiopie des périodes pré-axoumite et axoumite | publisher=Académie des inscriptions et belles-lettres | series=Publication of the De Goeje Fund | issue=2 volumes | year=1991 | isbn=978-3-447-11316-8 | url=https://books.google.com/books?id=WXT3MAAACAAJ | language=fr | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184754/https://books.google.com/books?id=WXT3MAAACAAJ | url-status=live }} Following Christianization in the 4th century, more extensive texts are known.{{sfn|Peust|2000|pp=259}}
- Libyco-Berber alphabet: over 1,000 inscriptions from the Maghreb,{{cite book | last=Chabot | first=J.B. | title=Recueil des inscriptions libyques | publisher=Imprimerie nationale | series=Gouvernement général de l'Algérie | issue=v. 1-3 | year=1940 | url=https://archive.org/details/RILRecueilDesInscriptionsLibyques | language=fr }} and {{cite book | last=Galand | first=L. | title=Inscriptions antiques du Maroc | publisher=Editions du Centre national de la recherche scientifique | series=Etudes d'Antiquités africaines | issue=v. 1 | year=1966 | url=https://books.google.com/books?id=c9RWAAAAMAAJ | language=fr | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184753/https://books.google.com/books?id=c9RWAAAAMAAJ | url-status=live }} which are dated to Roman times. Most texts do not use a word separator; Peust estimates that the total number of words could be around 5,000{{sfn|Peust|2000|pp=259}}
- Meroitic script (Ancient Nubian): about 900 texts are known, which Peust estimates may contain approximately 10,000 words, albeit with uncertainty from the fact that the word separator is not used consistently in the Meroitic script.{{cite book | last=Török | first=L. | title=The Kingdom of Kush: Handbook of the Napatan-Meroitic Civilization | publisher=Brill | series=Handbook of Oriental Studies / 1: Der Nahe und der Mittlere Osten | year=1997 | isbn=978-90-04-10448-8 | url=https://books.google.com/books?id=g0BtAAAAMAAJ | page=64}}{{sfn|Peust|2000|pp=259}}
=Aegean=
- The Cretan Linear A inscriptions that have not yet been deciphered{{cite book | last1=Godart | first1=L. | last2=Olivier | first2=J.P. | title=Recueil des inscriptions en linéaire A | publisher=P. Geuthner | issue=5 volumes | year=1976–85 | isbn=978-2-86958-470-9 | url=https://books.google.com/books?id=wEVXDQEACAAJ | language=fr | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184744/https://books.google.com/books?id=wEVXDQEACAAJ | url-status=live }} are available in about 2500 texts, which contain a total of around 20,000 characters. The total number of words can hardly be determined; Peust tentatively put it in the same order of magnitude as in Meroitic.{{sfn|Peust|2000|pp=259}}
- In addition to the Linear A texts, there are also inscriptions Cretan hieroglyphs of a few hundred characters{{cite book | last1=Poursat | first1=J.C. | last2=Godart | first2=L. | last3=Olivier | first3=J.P. | title=Le Quartier Mu: Introduction générale. Ecriture hiéroglyphique crétoise / par Louis Godart et Jean-Pierre Olivier. 1 | publisher=P. Geuthner | series=Fouilles Exécutées à Mallia | year=1978 | url=https://books.google.com/books?id=3wrUXwAACAAJ | language=fr | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184746/https://books.google.com/books?id=3wrUXwAACAAJ | url-status=live }} and texts written in the Greek alphabet, but not in Greek, with a few dozen words{{cite book | last=Duhoux | first=Y. | title=L'étéocrétois: les textes, la langue | publisher=J.C. Gieben | year=1982 | isbn=978-90-70265-05-2 | url=https://books.google.com/books?id=W_stAAAAMAAJ | language=fr | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184815/https://books.google.com/books?id=W_stAAAAMAAJ | url-status=live }}{{sfn|Peust|2000|pp=259}}
- Cypriot syllabary in the first millennium BC, in which mostly Greek texts were recorded.{{cite book | last=Masson | first=O. | title=Les inscriptions chypriotes syllabiques: recueil critique et commenté | publisher=E. de Boccard | series=École français d'Athènes: Études chypriotes | year=1961 | url=https://books.google.com/books?id=v6L2nQEACAAJ | language=fr | access-date=2023-05-01 | archive-date=2023-05-01 | archive-url=https://web.archive.org/web/20230501184739/https://books.google.com/books?id=v6L2nQEACAAJ | url-status=live }} The relevant texts comprise around 100 to 200 words.{{sfn|Peust|2000|pp=259}}
=Micro corpora =
There are a significant number of ancient micro-corpus languages. Estimating the total number of attested ancient languages may be as difficult as estimating their corpus size. For example, Greek and Latin sources hand down an enormous amount of foreign-language glosses, the seriousness of which is not always certain.{{sfn|Peust|2000|pp=259}}
Preservation and curation
{{see also| Archival science| Conservation and restoration of cultural property}}
Historic preservation and maintaining ancient text corpora presents several challenges, including issues with preservation, translation, and digitization. Many ancient texts have been lost over time, and those that survive may be damaged or fragmented. Translating ancient languages and scripts requires specialized expertise, and digitizing texts can be time-consuming and resource-intensive.
Corpus linguistics
{{main|Corpus linguistics}}
The field of corpus linguistics studies language as expressed in text corpora. This includes the analysis of word frequency, collocations, grammar, and semantics. Ancient text corpora provide a valuable resource for corpus linguistics research, enabling scholars to explore the evolution of language and culture over time.
See also
References
{{reflist}}
Bibliography
- {{cite book|first=Carsten|last=Peust|url=http://archiv.ub.uni-heidelberg.de/propylaeumdok/1893/1/Peust_Ueber_aegyptische_Lexikographie_2000.pdf|chapter=Über ägyptische Lexikographie. 1: Zum Ptolemaic Lexikon von Penelope Wilson; 2: Versuch eines quantitativen Vergleichs der Textkorpora antiker Sprachen|title=Lingua Aegyptia 7|date=2000|pages=245–260}}
- {{cite book|first=Michael P.|last=Streck|chapter=Großes Fach Altorientalistik. Der Umfang des keilschriftlichen Textkorpus|title=Mitteilungen der Deutschen Orientgesellschaft 142|date=2010|pages=35–58|url=https://www.orient-gesellschaft.de/repositorium/MDOG/MDOG_142.pdf}}
{{Writing systems}}
{{list of writing systems}}