:List of text corpora

{{Short description|Overview of data sets of languages}}

Text corpora (singular: text corpus) are large and structured sets of texts, which have been systematically collected. Text corpora are used by both AI developers to train large language models and corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency.{{cite book |author1=Leech, Geoffrey |author1-link=Geoffrey Leech |editor1-last=Wichmann |editor1-first=A. |display-editors=etal |title=Teaching and Language Corpora |date=2007 |publisher=Longman |location=London |page=9 |chapter=Teaching and language corpora: a convergence}}

English language

  • American National Corpus
  • Bank of English
  • BookCorpus
  • British National Corpus
  • Bergen Corpus of London Teenage Language (COLT)
  • Brown Corpus, forming part of the "Brown Family" of corpora, together with LOB, Frown and F-LOB
  • Corpus of Contemporary American English (COCA) 425 million words, 1990–2011. Freely searchable online
  • Corpus Resource Database (CoRD), more than 80 English language corpora.{{cite web|title=Corpus Resource Database (CoRD)|url=http://www.helsinki.fi/varieng/CoRD/corpora/|publisher=Department of English, University of Helsinki}}
  • [https://ruc.udc.es/dspace/handle/2183/21846 Coruña Corpus], a corpus of late Modern English scientific writing covering the period 1700–1900, developed by the [http://www.udc.es/grupos/muste Muste] research group at the University of A Coruña
  • [https://github.com/jpwahle/lrec22-d3-dataset/issues DBLP Discovery Dataset (D3)], a corpus of computer science publications with sentient metadata.{{Cite journal |last1=Wahle |first1=Jan Philip |last2=Ruas |first2=Terry |last3=Mohammad |first3=Saif |last4=Gipp |first4=Bela |date=2022 |title=D3: A Massive Dataset of Scholarly Metadata for Analyzing the State of Computer Science Research |url=https://aclanthology.org/2022.lrec-1.283 |journal=Proceedings of the Thirteenth Language Resources and Evaluation Conference |location=Marseille, France |publisher=European Language Resources Association |pages=2642–2651|arxiv=2204.13384 }}
  • [https://gucorpling.org/gum/ GUM corpus], the open source Georgetown University Multilayer corpus, with very many annotation layers
  • [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html Google Books Ngram Corpus]Professor Mark Davies at BYU created an online tool to search Google's English language corpus, drawn from Google Books, at http://googlebooks.byu.edu/x.asp.{{cite web|title=PhraseFinder|url=http://phrasefinder.io/|archive-url=https://web.archive.org/web/20160917201220/http://phrasefinder.io/|url-status=usurped|archive-date=September 17, 2016}} A search engine for the Google Books Ngram Corpus that supports wildcard queries and offers an API.
  • International Corpus of English
  • Oxford English Corpus
  • [https://github.com/dstl/re3d RE3D (Relationship and Entity Extraction Evaluation Dataset)]
  • [https://www.linguistics.ucsb.edu/research/santa-barbara-corpus Santa Barbara Corpus of Spoken American English]
  • Scottish Corpus of Texts & Speech
  • Strathy Corpus of Canadian English

European languages

  • [https://www.linguateca.pt/cetenfolha/index_info.html CETENFolha]
  • Basque: [https://www.ehu.eus/en/web/eins/goenkale-corpusa],[https://www.ehu.eus/en/web/eins/egungo-testuen-corpusa-etc- Basque corpora]
  • The Corpus of Electronic Texts
  • Corpus Inscriptionum Insularum Celticarum (CIIC), covering Primitive Irish inscriptions in Ogham
  • [http://storage.googleapis.com/books/ngrams/books/datasetsv2.html Google Books Ngram Corpus]
  • [http://corpora.iliauni.edu.ge/ The Georgian Language Corpus]
  • Thesaurus Linguae Graecae (Ancient Greek)
  • [http://eanc.net/ Eastern Armenian National Corpus] (EANC) 110 million words. Freely searchable online.
  • Spanish text corpus by Molino de Ideas, which contains 660 million words.{{in lang|es}} {{cite web|url=http://www.molinolabs.com/corpus.html|title=Molinolabs - corpus|publisher=molinolabs.com|access-date=12 January 2014}}
  • CorALit: the Corpus of Academic Lithuanian Academic texts published in 1999–2009 (approx. 9 million words). Compiled at the University of Vilnius, Lithuania{{cite web|url=http://coralit.lt/en/node/18|title=CorALit – CorALit - Lietuvių mokslo kalbos tekstynas|publisher=coralit.lt|access-date=12 January 2014}}
  • Reference Corpus of Contemporary Portuguese (CRPC)
  • Turkish National Corpus{{cite web|url=http://www.tnc.org.tr|title=Turkish National Corpus - Türkçe Ulusal Derlemi - Homepage|publisher=tnc.org.tr|access-date=12 January 2014}}
  • [http://corola.racai.ro/ CoRoLa - The Reference Corpus of the Contemporary Romanian Language (Corpus reprezentativ al limbii române contemporane )]
  • [https://tscorpus.com/ TS Corpus] - A large set of Turkish corpora. TS Corpus is a Free&Independent Project that aims to build Turkish corpora, NLP tools and linguistic datasets...
  • [http://www.nilc.icmc.usp.br/macmorpho MacMorpho] - an annotated corpus of Brazilian Portuguese text

= Slavic =

== East Slavic ==

  • [http://bnkorpus.info/ Belarusian N-korpus]
  • Russian National Corpus
  • General Internet Corpus of Russian
  • [http://uacorpus.org/ General regionally annotated corpus of Ukrainian]
  • [http://www.mova.info/corpus.aspx?l1=209/ Ukrainian Language Corpus on the Mova.info Linguistic Portal]
  • [https://web.archive.org/web/20190706091839/http://korpus.org.ua/ Ukrainian Language Corpus]
  • [http://unesco.uniba.sk/guest/ Araneum Russicum]
  • [https://archive.ics.uci.edu/ml/datasets/Russian+Corpus+of+Biographical+Texts Russian Corpus of Biographical Texts]{{Cite journal|last=Glazkova|first=A|title=Topical Classification of Text Fragments Accounting for Their Nearest Context|url=https://www.researchgate.net/publication/348432654|journal=Automation and Remote Control|date=2020|volume=81|issue=12|pages=2262–2276|doi=10.1134/S0005117920120097|s2cid=231929892}}
  • [https://study.mokoron.com/#download RuTweetCorp]{{Cite journal|last=Rubtsova|first=Yu|title=Constructing a corpus for sentiment classification training|date=2015|url=http://www.swsys.ru/index.php?page=article&id=3962&lang=&lang=en|journal=Software & Systems|volume=1|pages=72–78|doi=10.15827/0236-235X.109.072-078}}
  • [https://www.kaggle.com/oldaandozerskaya/fiction-corpus-for-agebased-text-classification RusAge: Corpus for Age-Based Text Classification ]

== South Slavic ==

  • Bulgarian National Corpus{{cite web|url=http://search.dcl.bas.bg|title=Under Update|publisher=search.dcl.bas.bg|access-date=12 January 2014}}
  • Macedonian Electronic Corpus{{Cite web | url=http://drmj.manu.edu.mk/%d0%b5%d0%bb%d0%b5%d0%ba%d1%82%d1%80%d0%be%d0%bd%d1%81%d0%ba%d0%b8-%d0%ba%d0%be%d1%80%d0%bf%d1%83%d1%81-%d0%bd%d0%b0-%d0%bc%d0%b0%d0%ba%d0%b5%d0%b4%d0%be%d0%bd%d1%81%d0%ba%d0%b8-%d0%ba%d0%bd%d0%b8/ | title= Електронски корупус на македонски книжевни текстови}}
  • Croatian Language Corpus
  • Croatian National Corpus
  • Slovenian National Corpus

== West Slavic ==

= German =

  • German Reference Corpus (DeReKo) More than 4 billion words of contemporary written German.
  • [https://rauschii.github.io/DysListGerman/ Free corpus of German mistakes from people with dyslexia]

Middle Eastern Languages

Turkic languages

  • [https://uzbekcorpus.uz/enVer Uzbek national corpus] (20 million words)

Devanagari

  • [https://ieee-dataport.org/open-access/nepali-text-corpus Nepali Text Corpus] (90+ million running words/6.5+ million sentences)

East Asian Languages

  • Kotonoha Japanese language corpus{{cite web|url=http://www.kotonoha.gr.jp/shonagon/|title=KOTONOHA「現代日本語書き言葉均衡コーパス」 少納言|publisher=kotonoha.gr.jp|access-date=12 January 2014}}
  • LIVAC Synchronous Corpus (Chinese)

South Asian Languages

  • Hindi:{{cite web | url=https://wortschatz.uni-leipzig.de/en/download/Hindi | title=Download Corpora Hindi }}
  • [https://osf.io/a5quv/ SinMin] datasetD. Upeksha, C. Wijayarathna, M. Siriwardena, L. Lasandun, C. Wimalasuriya, N. de Silva, and G. Dias . 2015. [https://www.researchgate.net/publication/306264442_Implementing_a_Corpus_for_Sinhala_Language Implementing a Corpus for Sinhala Language]. In Symposium on Language Technology for South Asia. (Sinhala)

African languages

  • Amharic:Glossa (uio.no)
  • Creole (Gulf of Guinea): {{cite web | url=https://aclanthology.org/L14-1376/ | title=The Gulf of Guinea Creole Corpora | date=May 2014 | pages=523–529 }}
  • Hausa:https://arxiv.org/pdf/2102.06991.pdf, https://wortschatz.uni-leipzig.de/en/download/Hausa
  • Igbo: {{cite web | url=https://www.sketchengine.eu/igtenten-igbo-corpus/ | title=IgTenTen – Igbo corpus from the web | Sketch Engine | date=20 June 2022 }}
  • Oromo: {{cite web | url=https://www.sketchengine.eu/corpora-and-languages/oromo-text-corpora/ | title=Oromo text corpora | Sketch Engine | date=15 January 2019 }}
  • Yoruba: https://www.researchgate.net/publication/336274457_Digital_Yoruba_Corpus, https://www.sketchengine.eu/corpora-and-languages/yoruba-text-corpora/
  • Zulu: {{cite web | url=https://wortschatz.uni-leipzig.de/en/download/Zulu | title=Download Corpora Zulu }}

Parallel corpora of diverse languages

  • [https://digital.lib.hkbu.edu.hk/cepic/ Chinese/English Political Interpreting Corpus (CEPIC)] {{Cite web|last=Pan|first=Jun|date=2019|title=The Chinese/English Political Interpreting Corpus (CEPIC). Hong Kong Baptist University Library|url=https://digital.lib.hkbu.edu.hk/cepic/|access-date=January 3, 2022}}{{Cite journal|last=Pan|first=Jun|date=2019-10-30|title=The Chinese/English Political Interpreting Corpus (CEPIC): A New Electronic Resource for Translators and Interpreters|journal=Proceedings of the Second Workshop Human-Informed Translation and Interpreting Technology Associated with RANLP 2019|pages=82–88 |publisher=Incoma Ltd., Shoumen, Bulgaria|doi=10.26615/issn.2683-0078.2019_010|s2cid=211257773 |doi-access=free}} consists of transcripts of speeches delivered by top political figures from Hong Kong, Beijing, Washington DC and London, as well as their translated/interpreted texts. Developed by Jun Pan and HKBU Library.
  • Europarl Corpus - proceedings of the European Parliament from 1996 to 2012
  • EUR-Lex corpus - collection of all official languages of the European Union, created from the EUR-Lex database{{cite web|url=https://www.sketchengine.co.uk/eurlex-corpus/|title=EUR-Lex Corpus|date=2 June 2016|publisher=sketchengine.co.uk|access-date=27 October 2016}}
  • OPUS: Open source Parallel Corpus in many many languages{{cite web|url=http://opus.lingfil.uu.se/|title=OPUS - an open source parallel corpus|publisher=opus.lingfil.uu.se|access-date=12 January 2014}}
  • Tatoeba A parallel corpus which contains over 8.9 million sentences in multiple languages; 107 languages have more than 1,000 sentences each; a further 81 languages have from 100 to 1,000 sentences each.{{cite web|url=http://tatoeba.org/eng/stats/sentences_by_language|title=Tatoeba - Number of sentences per language|publisher=tatoeba.org|access-date=23 November 2020}}
  • [https://compling.upol.cz/ntumc/ NTU-Multilingual Corpus] in 7 languages (ara, eng, ind, jpn, kor, mcn, vie){{cite journal|url=http://www.colips.org/journal/volume22/22.4.2.NTU-MC%20Tan%20final.pdf|title=Building and Annotating the Linguistically Diverse NTU-MC (NTU — Multilingual Corpus)|author=Liling Tan and Francis Bond|date=14 May 2012|journal=International Journal of Asian Language Processing|volume=22|issue=4|pages=161–174|access-date=12 January 2014|archive-url=https://web.archive.org/web/20140116120131/http://www.colips.org/journal/volume22/22.4.2.NTU-MC%20Tan%20final.pdf|archive-date=16 January 2014|url-status=dead}} ([https://github.com/alvations/NTU-MC legacy repo])
  • [https://github.com/alvations/SeedLing SeedLing] corpus - A Seed Corpus for the Human Language Project with 1000+ languages from various sources.Guy Emerson, Liling Tan, Susanne Fertmann, Alexis Palmer and Michaela Regneri . 2014. [http://anthology.aclweb.org/W/W14/W14-22.pdf#page=87 SeedLing: Building and using a seed corpus for the Human Language Project]. In Proceedings of the use of Computational methods in the study of Endangered Languages (ComputEL) Workshop. Baltimore, USA.
  • [http://www-gewi.uni-graz.at/gralis/korpusarium/gralis_korpus.html GRALIS] parallel texts for various Slavic languages, compiled by the institute for Slavic languages at Graz University (Branko Tošović et al.)
  • [https://actres.unileon.es/wordpress/?page_id=33&lang=en The ACTRES Parallel Corpus] (P-ACTRES 2.0) is a bidirectional English-Spanish corpus consisting of original texts in one language and their translation into the other. P-ACTRES 2.0 contains over 6 million words considering both directions together.H. Sanjurjo-González and M. Izquierdo. 2019. [https://benjamins.com/catalog/scl.90.13san P-ACTRES 2.0: A parallel corpus for cross-linguistic research]. In Parallel Corpora for Contrastive and Translation Studies: New resources and applications (pp. 215-231). John Benjamins Publishing.

{{#section-h::Parallel_text|Parallel corpora}}

Comparable Corpora

  • [https://digital.lib.hkbu.edu.hk/corpus/index.php Corpus of Political Speeches] contains four collections of political speeches in English and Chinese from The Corpus of U.S. Presidential Speeches (1789–2015), The Corpus of Policy Address by Hong Kong Governors (1984–1996) and Hong Kong Chief Executives (1997–2014), The Corpus of Speeches given on New Year's days and Double Tenth days by Taiwan Presidents (1978–2014), and The Corpus of Report on the Work of the Government by Premiers of the People's Republic of China (1984–2013). Developed by HKBU Library.
  • [http://wacky.sslmit.unibo.it/doku.php?id=corpora WaCky - The Web-As-Corpus Kool Yinitiative Web as Corpus] (eng, fre, deu, ita)
  • [https://bitbucket.org/alvations/dslsharedtask2014 Disambiguating Similar Language Corpora Collection (DSLCC)]Liling Tan, Marcos Zampieri, Nikola Ljubešic, and Jörg Tiedemann. [http://comparable.limsi.fr/bucc2014/4.pdf Merging comparable data sources for the discrimination of similar languages: The DSL corpus collection.] In Proceedings of the 7th Workshop on Building and Using Comparable Corpora (BUCC). 2014. (Bosnian, Croatian, Serbian, Indonesian, Malay, Czech, Slovak, Brazilian Portuguese, European Portuguese, Peninsular Spanish, Argentine Spanish)
  • [http://linguatools.org/tools/corpora/wikipedia-comparable-corpora/ Wikipedia Comparable Corpora]{{registration required}} when (41 million aligned Wikipedia articles for 253 language pairs)
  • The TenTen Corpus Family – comparable web corpora of target size 10 billion words. These corpora are available in the corpus management system Sketch Engine, currently, there exist TenTen corpora for more than 30 languages (such as English TenTen corpus,{{Cite book | doi=10.1007/978-3-642-32790-2_1|chapter = Getting to Know Your Corpus|title = Text, Speech and Dialogue| volume=7499| pages=3–15|series = Lecture Notes in Computer Science|year = 2012|last1 = Kilgarriff|first1 = Adam| isbn=978-3-642-32789-6| citeseerx=10.1.1.452.8074}} Arabic TenTen corpus,Belinkov, Y., Habash, N., Kilgarriff, A., Ordan, N., Roth, R., & Suchomel, V. (2013). [https://www.sketchengine.co.uk/wp-content/uploads/arTenTen_corpus_for_Arabic_2013.pdf arTen-Ten: a new, vast corpus for Arabic]. Proceedings of WACL. Spanish TenTen corpus,Kilgarriff, A., & Renau, I. (2013). [https://www.sketchengine.co.uk/wp-content/uploads/esTenTen_web_corpus_of_Peninsular_and_American_Spanish_2013.pdf esTenTen, a vast web corpus of Peninsular and American Spanish]. Procedia - Social and Behavioral Sciences, 95, 12-19. Russian Tenten corpus,Хохлова, М. В. (2016). [http://openbooks.ifmo.ru/ru/file/4106/4106.pdf Обзор больших русскоязычных корпусов текстов]. In Материалы научной конференции" Интернет и современное общество" (pp. 74-77).Khokhlova, M. (2016). [https://nlp.fi.muni.cz/raslan/raslan16.pdf#page=17 Comparison of High-Frequency Nouns from the Perspective of Large Corpora]. RASLAN 2016 Recent Advances in Slavonic Natural Language Processing, 9.). The overview of existing TenTen corpora can be found at https://www.sketchengine.co.uk/documentation/tenten-corpora/
  • Timestamped JSI web corpora – web corpora of news articles crawled from a list of RSS feeds. Newsfeed corpora are being prepared in the framework of the project implemented by the Jožef Stefan Institute at Slovenian scientific research institute.Trampuš, M., & Novak, B. (2012, October). [http://ailab.ijs.si/dunja/SiKDD2012/Papers/Trampus_Newsfeed.pdf Internals of an aggregated web news feed]. In Proceedings of the Fifteenth International Information Science Conference IS SiKDD 2012 (pp. 431-434) and published in Sketch Engine. More information about the project is on the [http://newsfeed.ijs.si/ project websites].

L2 (English) Corpora

  • Cambridge Learner Corpus{{Citation|title=Cambridge English Corpus|date=2019-09-27|url=https://en.wikipedia.org/w/index.php?title=Cambridge_English_Corpus&oldid=918173927|work=Wikipedia|language=en|access-date=2020-01-07}}
  • Corpus of Academic Written and Spoken English (CAWSE),{{Cite web|url=https://www.nottingham.edu.cn/en/education-and-english/research/cawse/cawse-corpus.aspx|title=CAWSE Corpus - The University of Nottingham Ningbo China - 宁波诺丁汉大学|website=nottingham.edu.cn|access-date=2020-01-07}} a collection of Chinese students’ English language samples in academic settings. Freely downloadable [https://cawse.transcribear.com/dataaccess.asp online].  
  • English as a Lingua Franca in Academic Settings (ELFA),{{Cite web|url=https://www.helsinki.fi/en/researchgroups/english-as-a-lingua-franca-in-academic-settings|title=English as a Lingua Franca in Academic Settings|date=2018-03-23|website=University of Helsinki|language=en|access-date=2020-01-07}} an academic ELF corpus.{{Citation|title=English as a lingua franca|date=2019-12-14|url=https://en.wikipedia.org/w/index.php?title=English_as_a_lingua_franca&oldid=930727312|work=Wikipedia|language=en|access-date=2020-01-07}}{{Cite journal|last=Mauranen|first=A|title=English as an academic lingua franca: The ELFA project|journal=English for Specific Purposes|year=2010|volume=29|issue=3|pages=183–190|doi=10.1016/j.esp.2009.10.001}}
  • International Corpus of Learner English (ICLE),{{Cite web|url=https://uclouvain.be/en/research-institutes/ilc/cecl/icle.html|title=ICLE|website=UCLouvain|language=en|access-date=2020-01-07}} a corpus of learner written English.
  • Louvain International Database of Spoken English Interlanguage (LINDSEI),{{Cite web|url=https://uclouvain.be/fr/node/11968|title=LINDSEI|website=UCLouvain|language=fr|access-date=2020-01-07}} a corpus of learner spoken English.
  • Trinity Lancaster Corpus, one of the largest corpus of L2 spoken English.{{Cite web|url=http://cass.lancs.ac.uk/trinity-lancaster-corpus/|title=Trinity Lancaster Corpus {{!}} ESRC Centre for Corpus Approaches to Social Science (CASS)|language=en-US|access-date=2020-01-07}}{{Cite journal|last=Gablasova|first=D|date=2019|title=The Trinity Lancaster Corpus: Development, Description and Application.|journal=International Journal of Learner Corpus Research|volume=5|issue=2|pages=126–158|doi=10.1075/ijlcr.19001.gab|doi-access=free}}
  • University of Pittsburgh English Language Institute Corpus (PELIC)Juffs, A., Han, N-R., & Naismith, B. (2020). The University of Pittsburgh English Language Corpus (PELIC) [Data set]. {{doi|10.5281/zenodo.3991977}}
  • Vienna-Oxford International Corpus of English (VOICE),{{Cite web|url=https://www.univie.ac.at/voice/page/index.php|title=Project|website=univie.ac.at|access-date=2020-01-07}} an ELF corpus.

References

{{Reflist|2}}

See also