Page Analysis and Ground Truth Elements

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents.{{Cite web|url=https://github.com/PRImA-Research-Lab/PAGE-XML|title=PAGE-XML|date=July 12, 2022|via=GitHub}} Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.

PAGE XML can be used to describe:{{Citation needed|date=July 2023}}

page content (regions, lines of text, words, glyphs, reading order, text content, ...)
the evaluation of the layout analysis (evaluation profiles, evaluation results, ...)
the cutting of the document image (cutting grids)

The format is developed by the Pattern Recognition & Image Analysis Lab (PRIMA) at the University of Salford in Manchester.{{Citation needed|date=July 2023}}

It was designed to be used in conjunction with automatic segmentation and transcription techniques (OCR and HTR): indeed, PAGE aims to support each of the different steps in the processing chain for image document analysis (from image enhancement to layout analysis to OCR).{{Citation needed|date=July 2023}}

The PAGE XML schema is notably used as an export and import format by automatic transcription software such as eScriptorium{{Cite web|url=https://escripta.hypotheses.org/|title=eScripta – Digital Tools and Techniques for the Study of Ancient Writing}} and Transkribus.{{Cite web|url=https://readcoop.eu/transkribus/howto/how-to-export-documents-from-transkribus/|title=How To Export Documents from Transkribus|website=READ-COOP}} It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts.{{Cite web|url=https://kraken.re/|title=The Kraken OCR system|first=Benjamin|last=Kiessling|date=April 5, 2022|via=GitHub}}

References

External links

[https://github.com/PRImA-Research-Lab/PAGE-XML/blob/master/documentation/XML%20File%20Structure.pdf Documentation]
[https://www.primaresearch.org/schema/PAGE/gts/pagecontent/2016-07-15/Simple%20PAGE%20XML%20Example.pdf Encoding example]
[https://ocr-d.de/en/gt-guidelines/trans/trPage.html Documentation of the PAGE XML Format for Page Content] in the OCR-D project, funded by Deutsche Forschungsgemeinschaft.
[http://www.primaresearch.org/schema/PAGE/gts/pagecontent/2019-07-15/pagecontent.xsd Documentation "Page Content - Ground Truth and Storage"]
[http://www.primaresearch.org/schema/PAGE/eval/layout/2013-07-15/layouteval.xsd Documentation "Evaluation - Metadata, Profile and Results"]
[http://www.primaresearch.org/schema/PAGE/gts/dewarping/2014-08-26/dewarping.xsd Documentation "Dewarping - Ground Truth and Storage"]

Category:XML-based standards

Category:Optical character recognition

Category:Handwriting recognition