Text Encoding Initiative

{{short description|Academic community concerned with text encoding}}

{{CS1 config|mode=cs1}}

thumb

The Text Encoding Initiative (TEI) is a text-centric community of practice in the academic field of digital humanities, operating continuously since the 1980s. The community currently runs a mailing list, meetings and conference series, and maintains the TEI technical standard, a journal,{{cite news |title=Journal of the Text Encoding Initiative |url=https://journals.openedition.org/jtei/ |website=Open Edition Journals |access-date=29 June 2022}} a wiki, a GitHub repository and a toolchain.

TEI guidelines

The TEI Guidelines collectively define a type of XML format, and are the defining output of the community of practice. The format differs from other well-known open formats for text (such as HTML and OpenDocument) in that it is primarily semantic rather than presentational: the semantics and interpretation of every tag and attribute are specified. There are some 500 different textual components and concepts: {{mono|word}},{{Cite web|url=https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-w.html|title=TEI element w (word)|website=tei-c.org}} {{mono|sentence}},{{Cite web|url=https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-s.html|title=TEI element s (s-unit)|website=tei-c.org}} {{mono|character}},{{Cite web|url=https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-c.html|title=TEI element c (character)|website=tei-c.org}} {{mono|glyph}},{{Cite web|url=https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-g.html|title=TEI element g (character or glyph)|website=tei-c.org}} {{mono|person}},{{Cite web|url=https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-person.html|title=TEI element person (person)|website=tei-c.org}}

etc. Each is grounded in one or more academic disciplines and examples are given.

=Technical details=

The standard is split into two parts, a discursive textual description with extended examples and discussion and set of tag-by-tag definitions. Schemata in most of the modern formats (DTD, RELAX NG and XML Schema (W3C)) are generated automatically from the tag-by-tag definitions. A number of tools support the production of the guidelines and the application of the guidelines to specific projects.

A number of special tags are used to circumvent restrictions imposed by the underlying Unicode; {{mono|glyph}} to allow representation of characters that do not qualify for Unicode inclusion{{Failed verification|date=March 2025 |reason=This probably wants to be, instead of a reference to the page for word, the g or glyph pages, or perhaps https://tei-c.org/release/doc/tei-p5-doc/en/html/WD.html ; but in none of these places is this exact rationale for 𝚐𝚕𝚢𝚙𝚑 given, anyway. If anything, the page just linked suggests that you 𝘴𝘩𝘰𝘶𝘭𝘥 submit your new characters to Unicode for inclusion!}} and {{mono|choice}} to allow overcome the required strict linearity.{{cite web|url=http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-choice.html|title=Element choice|website=www.tei-c.org}}

Most users of the format do not use the complete range of tags, but produce a customisation using a project-specific subset of the tags and attributes defined by the Guidelines. The TEI defines a sophisticated customization mechanism known as ODD for this purpose. In addition to documenting and describing each TEI tag, an ODD specification specifies its content model and other usage constraints, which may be expressed using schematron.

TEI Lite is an example of such a customization. It defines an XML-based file format for exchanging texts. It is a manageable selection from the extensive set of elements available in the full TEI Guidelines.

As an XML-based format, TEI cannot directly deal with overlapping markup and non-hierarchical structures. A variety of options to represent this sort of data is suggested by the guidelines.{{cite web |url= https://tei-c.org/release/doc/tei-p5-doc/en/html/NH.html |title=20 Non-hierarchical Structures - TEI P5: — Guidelines for Electronic Text Encoding and Interchange |work=tei-c.org |year=2019 |access-date=19 March 2019}}

=Examples=

The text of the TEI guidelines is rich in examples. There is also a samples page on the TEI wiki,{{cite web |url= http://wiki.tei-c.org/index.php/Samples_of_TEI_texts |title=Samples of TEI texts |work=wiki.tei-c.org |year=2011 |access-date=17 April 2012}} which gives examples of real-world projects that expose their underlying TEI.

==Prose tags==

TEI allows texts to be marked up syntactically at any level of granularity, or mixture of granularities. For example, this paragraph (p) has been marked up into sentences (s) and clauses (cl).{{cite web |url= http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AI.html#AILCW |title=17 Simple Analytic Mechanisms - TEI P5: — Guidelines for Electronic Text Encoding and Interchange |work=tei-c.org |year=2012 |access-date=15 April 2012}}

It was about the beginning of September, 1664,

that I, among the rest of my neighbours,

heard in ordinary discourse

that the plague was returned again to Holland;

for it had been very violent there, and particularly at

Amsterdam and Rotterdam, in the year 1663,

whither, they say, it was brought,

some said from Italy, others from the Levant, among some goods

which were brought home by their Turkey fleet;

others said it was brought from Candia;

others from Cyprus.

It mattered not from whence it came;

but all agreed it was come into Holland again.

==Verse==

TEI has tags for marking up verse. This example (taken from the French translation of the TEI Guidelines) shows a sonnet.{{cite web |url=http://www.tei-c.org/release/doc/tei-p5-doc/fr/html/ref-lg |title=TEI element lg (groupe de vers) |work=tei-c.org |year=2012 |access-date=15 April 2012 |archive-date=6 June 2012 |archive-url=https://web.archive.org/web/20120606011418/http://www.tei-c.org/release/doc/tei-p5-doc/fr/html/ref-lg |url-status=dead }}

Les amoureux fervents et les savants austères

Aiment également, dans leur mûre saison,

Les chats puissants et doux, orgueil de la maison,

Qui comme eux sont frileux et comme eux sédentaires.

Amis de la science et de la volupté

Ils cherchent le silence et l'horreur des ténèbres ;

L'Érèbe les eût pris pour ses coursiers funèbres,

S'ils pouvaient au servage incliner leur fierté.

Ils prennent en songeant les nobles attitudes

Des grands sphinx allongés au fond des solitudes,

Qui semblent s'endormir dans un rêve sans fin ;

Leurs reins féconds sont pleins d'étincelles magiques,

Et des parcelles d'or, ainsi qu'un sable fin,

Étoilent vaguement leurs prunelles mystiques.

==Choice tag==

The {{mono|choice}} tag is used to represent sections of text that might be encoded or tagged in more than one possible way. In the following example, based on one in the standard, {{mono|choice}} is used twice, once to indicate an original and a corrected number, and once to indicate an original and regularised spelling.{{cite web |url= http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-choice.html |title=TEI element choice |work=tei-c.org |year=2012 |access-date=15 April 2012}}

Lastly, That, upon his solemn oath to observe all the above

articles, the said man-mountain shall have a daily allowance of

meat and drink sufficient for the support of

1724

1728

of our subjects,

with free access to our royal person, and other marks of our

favour

favor

.

ODD

One Document Does it all ("ODD") is a literate programming language for XML schemas.{{cite conference|year=2004|last1=Bauman|first1=Syd|last2=Flanders|first2=Julia|title=ODD customizations|conference=Extreme Markup Languages 2004|url=http://conferences.idealliance.org/extreme/html/2004/Bauman01/EML2004Bauman01.html|access-date=2012-04-15|archive-date=2012-03-29|archive-url=https://web.archive.org/web/20120329180310/http://conferences.idealliance.org/extreme/html/2004/Bauman01/EML2004Bauman01.html|url-status=dead}}{{cite conference|year=2004|last1=Burnard|first1=Lou|author2-link=Sebastian Rahtz|last2=Rahtz|first2=Sebastian|title=RelaxNG with Son of ODD|conference=Extreme Markup Languages 2004|url=http://conferences.idealliance.org/extreme/html/2004/Burnard01/EML2004Burnard01.html|access-date=2012-04-15|archive-date=2012-03-29|archive-url=https://web.archive.org/web/20120329180319/http://conferences.idealliance.org/extreme/html/2004/Burnard01/EML2004Burnard01.html|url-status=dead}}{{cite conference|title=Literate Documentation for XML|last=Reiss|first=Kevin M.|conference=Digital Humanities 2007|location=Urbana-Champaign, Illinois|year=2007|url=http://dhcommons.tamu.edu/sites/default/files/poster_208_reiss.pdf|access-date=2012-04-15|archive-date=2016-03-03|archive-url=https://web.archive.org/web/20160303234829/http://dhcommons.tamu.edu/sites/default/files/poster_208_reiss.pdf|url-status=dead}}{{Cite journal|title=A complete schema definition language for the Text Encoding Initiative|first1=Lou|last1=Burnard|author2-link=Sebastian Rahtz|first2=Sebastian|last2=Rahtz|date=June 2013|journal=XML London 2013|doi=10.14337/XMLLondon13.Rahtz01|url=http://xmllondon.com/2013/presentations/rahtz/|pages=152–161|doi-broken-date=1 November 2024 |isbn=978-0-9926471-0-0|doi-access=free}}

In literate-programming style, ODD documents combine human-readable documentation and machine-readable models using the Documentation Elements module of the Text Encoding Initiative. Tools generate localised and internationalised HTML, ePub, or PDF human-readable output and DTDs, W3C XML Schema, Relax NG Compact Syntax, or Relax NG XML Syntax machine-readable output.

The Roma web application[https://roma2.tei-c.org/ Roma web application] is built around the ODD format and can use it to generate schemas in DTD, W3C XML Schema, Relax NG Compact Syntax, or Relax NG XML Syntax formats, as used by many XML validation tools and services.

ODD is the format used internally by the Text Encoding Initiative for the TEI technical standard.{{cite web|year=2007|editor1-first=Lou|editor1-last=Burnard|editor2-first=Syd|editor2-last=Bauman|title=TEI P5: Guidelines for Electronic Text Encoding and Interchange|publisher=TEI Consortium|location=Charlottesville, Virginia, USA|url=http://www.tei-c.org/Guidelines/P5/}} Although ODD files generally describe the difference between a customized XML format and the full TEI model, ODD also can be used to describe XML formats that are entirely separate from the TEI. One example of this is the W3C's Internationalization Tag Set which uses the ODD format to generate schemas and document its vocabulary.{{cite web | editor-last1=Lieske | editor-first1=Christian | editor-last2=Sasaki |editor-first2=Felix | date=3 April 2007 | title=Internationalization Tag Set (ITS) Version 1.0 | publisher=World Wide Web Consortium | at=§1.5 Development of this specification | url=https://www.w3.org/TR/2007/REC-its-20070403/#spec-development}}{{cite web|title=Best Practices for XML Internationalization|editor1-last=Savourel|editor1-first=Yves|editor2-last=Kosek|editor2-first=Jirka|editor3-last=Ishida|editor3-first=Richard|publisher=W3C Working Group|year=2008|url=http://www.w3.org/TR/xml-i18n-bp/|at=5.2 ITS and TEI}}

TEI customizations

TEI customizations are specializations of the TEI XML specification for use in particular fields or by specific communities.

  • EpiDoc (Epigraphic Documents)
  • Charters Encoding Initiative{{Cite web|url=http://www.cei.lmu.de/|title=Charters Encoding Initiative - Ludwig-Maximilians-Universität München|website=www.cei.lmu.de}}
  • Medieval Nordic Text Archive (Menota){{Cite web|url=https://www.menota.org/forside.xhtml|title=Medieval Nordic Text Archive (Menota)|website=www.menota.org}}

Customization in the TEI is done through the ODD mechanism mentioned above. In truth since its P5 version, all so-called 'TEI Conformant' uses of the TEI Guidelines are based on a TEI customization documented in a TEI ODD file. Even when users choose one of the off-the-shelf pre-generated schemas to validate against, these have been created from freely available customization files.

Projects

The format is used by many projects worldwide. Practically all projects are associated with one or more universities. Some well-known projects that encode texts using TEI include:

class="wikitable"

|+TEI projects

scope="col"| Project

! scope="col"| URL

! scope="col"| Subject(s)

scope="row"| British National Corpus

| http://www.natcorp.ox.ac.uk

| 100-million-word snapshot of current English-language usage

scope="row"| Oxford Text Archive

| https://ota.bodleian.ox.ac.uk/repository/xmlui/

| >1 GB of Linguistic data and electronic texts in 25 languages

scope="row"| Perseus Project

| https://www.perseus.tufts.edu/

| Greek and Latin texts

scope="row"| EpiDoc

| https://sourceforge.net/p/epidoc/wiki/Home/

| Epigraphy and papyrology

scope="row"| Women Writers Project

| https://wwp.northeastern.edu/

| Early modern women writers (Margaret Cavendish, Eliza Haywood, etc.)

scope="row"| New Zealand Electronic Text Centre

| http://www.nzetc.org/

| New Zealand and Pacific Islands texts

scope="row"| The SWORD Project

| https://www.crosswire.org/sword/

| Bible software, dictionaries, Christian literature

scope="row"| FreeDict

| https://freedict.org/

| Bilingual dictionaries

scope="row"| Text Creation Partnership

| https://textcreationpartnership.org/

| Early British and American books

scope="row"| CELT

| https://celt.ucc.ie/publishd.html

| Ancient and medieval Irish manuscripts

scope="row"| ISTEX

| https://www.istex.fr/

| Archives of scientific publications

scope="row"| CAB

| https://cab.geschkult.fu-berlin.de/

| An edition of the Zoroastrian rituals of the Avesta, in the Avestan languages

History

Prior to the creation of TEI, humanities scholars had no common standards for encoding electronic texts in a manner that would serve their academic goals (Hockey 1993, p. 41). In 1987, a group of scholars representing fields in humanities, linguistics, and computing convened at Vassar College to put forth a set of guidelines known as the “Poughkeepsie Principles”. These guidelines directed the development of the first TEI standard, "P1".{{cite journal|last=Ahronheim|first=J.R.|title=Descriptive metadata: Emerging standards.|journal=Journal of Academic Librarianship|year=1998|volume=24|issue=5|pages=395–403|doi=10.1016/S0099-1333(98)90079-9}}{{cite journal|last=Cantara |first=L.|title=The text-encoding initiative: Part 1|journal=OCLC Systems & Services|year=2005|volume=21|issue=1|pages=36–39|doi=10.1108/10650750510578136}}

  • 1987 – Work started by the Association for Computers and the Humanities,{{Cite web|url=https://ach.org/|title=The Association for Computers and the Humanities ||website=ach.org}} the Association for Computational Linguistics, and the Association for Literary and Linguistic Computing on what would become the TEI."Historical background", [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/AB.html#ABTEI section iv.2] of TEI P5: Guidelines for Electronic Text Encoding and Interchange. This culminated in the Closing statement of the Vassar Planning Conference.{{cite web |url= http://www.tei-c.org/Vault/SC/teipcp1.txt |title= Closing statement of the Vassar Planning Conference |work=tei-c.org |year=2009 |access-date=15 April 2012}}
  • 1994 – TEI P3 released,{{cite web|url=http://www.tei-c.org/Guidelines/|title=TEI Guidelines|access-date=2010-06-18}} co-edited by Lou Burnard (at Oxford University) and Michael Sperberg-McQueen (then at the University of Illinois at Chicago, later at the W3C).
  • 1999 – TEI P3 updated.
  • 2002 – TEI P4 released, moving from SGML to XML; adoption of Unicode, which XML parsers are required to support.{{cite book |title=XML Basics|chapter=2|chapter-url=http://www.xmlnews.org/docs/xml-basics.html |access-date=2011-07-09}}
  • 2007 – TEI P5 released, including integration with the xml:lang and xml:id attributes from the W3C{{cite web|url=http://www.w3.org/TR/REC-xml/|title=Extensible Markup Language (XML) 1.0 (Fifth Edition)|work=w3.org}} (these had previously been attributes in the TEI namespace), regularization of local pointing attributes to use the hash (as used in HTML) and unification of the ptr and xptr tags. Together these changes with many more new additions make P5 more regular and bring it closer to current xml practice as promoted by the W3C and as used by other XML variants. Maintenance and feature update versions of TEI P5 have been released at least twice a year since 2007.
  • 2011 – TEI P5 v2.0.1 released with support for genetic editing{{cite web |url= http://www.tei-c.org/release/doc/tei-p5-doc/readme-2.0.1.html |title=P5 version 2.0.1 release notes |work=tei-c.org |year=2012 |access-date=15 April 2012}} (among many other additions, the genetic-editing features allow encoding of texts without interpretation as to their specific semantics).
  • 2017 – TEI was awarded the Antonio Zampolli Prize from the Alliance of Digital Humanities Organizations.{{Cite web | url=https://tei-c.org/ | title=TEI: Text Encoding Initiative}}

References

{{reflist|30em}}