Page Analysis and Ground Truth Elements
Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents.{{Cite web|url=https://github.com/PRImA-Research-Lab/PAGE-XML|title=PAGE-XML|date=July 12, 2022|via=GitHub}} Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.
PAGE XML can be used to describe:{{Citation needed|date=July 2023}}
- page content (regions, lines of text, words, glyphs, reading order, text content, ...)
- the evaluation of the layout analysis (evaluation profiles, evaluation results, ...)
- the cutting of the document image (cutting grids)
The format is developed by the Pattern Recognition & Image Analysis Lab (PRIMA) at the University of Salford in Manchester.{{Citation needed|date=July 2023}}
It was designed to be used in conjunction with automatic segmentation and transcription techniques (OCR and HTR): indeed, PAGE aims to support each of the different steps in the processing chain for image document analysis (from image enhancement to layout analysis to OCR).{{Citation needed|date=July 2023}}
The PAGE XML schema is notably used as an export and import format by automatic transcription software such as eScriptorium{{Cite web|url=https://escripta.hypotheses.org/|title=eScripta – Digital Tools and Techniques for the Study of Ancient Writing}} and Transkribus.{{Cite web|url=https://readcoop.eu/transkribus/howto/how-to-export-documents-from-transkribus/|title=How To Export Documents from Transkribus|website=READ-COOP}} It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts.{{Cite web|url=https://kraken.re/|title=The Kraken OCR system|first=Benjamin|last=Kiessling|date=April 5, 2022|via=GitHub}}
References
{{Reflist}}
External links
- [https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/documentation/XML%20File%20Structure.pdf Documentation]
- [https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2016-07-15/Simple%20PAGE%20XML%20Example.pdf Encoding example]
- [https://ocr-d.de/en/gt-guidelines/trans/trPage.html Documentation of the PAGE XML Format for Page Content] in the OCR-D project, funded by Deutsche Forschungsgemeinschaft.
- [http://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd Documentation "Page Content - Ground Truth and Storage"]
- [http://www.primaresearch.org/schema/PAGE/eval/layout/2013-07-15/layouteval.xsd Documentation "Evaluation - Metadata, Profile and Results"]
- [http://www.primaresearch.org/schema/PAGE/gts/dewarping/2014-08-26/dewarping.xsd Documentation "Dewarping - Ground Truth and Storage"]