Page Analysis and Ground Truth Elements

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents.{{Cite web|url=https://github.com/PRImA-Research-Lab/PAGE-XML|title=PAGE-XML|date=July 12, 2022|via=GitHub}} Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.

PAGE XML can be used to describe:{{Citation needed|date=July 2023}}

  • page content (regions, lines of text, words, glyphs, reading order, text content, ...)
  • the evaluation of the layout analysis (evaluation profiles, evaluation results, ...)
  • the cutting of the document image (cutting grids)

The format is developed by the Pattern Recognition & Image Analysis Lab (PRIMA) at the University of Salford in Manchester.{{Citation needed|date=July 2023}}

It was designed to be used in conjunction with automatic segmentation and transcription techniques (OCR and HTR): indeed, PAGE aims to support each of the different steps in the processing chain for image document analysis (from image enhancement to layout analysis to OCR).{{Citation needed|date=July 2023}}

The PAGE XML schema is notably used as an export and import format by automatic transcription software such as eScriptorium{{Cite web|url=https://escripta.hypotheses.org/|title=eScripta – Digital Tools and Techniques for the Study of Ancient Writing}} and Transkribus.{{Cite web|url=https://readcoop.eu/transkribus/howto/how-to-export-documents-from-transkribus/|title=How To Export Documents from Transkribus|website=READ-COOP}} It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts.{{Cite web|url=https://kraken.re/|title=The Kraken OCR system|first=Benjamin|last=Kiessling|date=April 5, 2022|via=GitHub}}

References

{{Reflist}}