Page Analysis and Ground Truth Elements
Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents.[1] Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.
PAGE XML can be used to describe:[citation needed]
- page content (regions, lines of text, words, glyphs, reading order, text content, ...)
- the evaluation of the layout analysis (evaluation profiles, evaluation results, ...)
- the cutting of the document image (cutting grids)
The format is developed by the Pattern Recognition & Image Analysis Lab (PRIMA) at the University of Salford in Manchester.[citation needed]
It was designed to be used in conjunction with automatic segmentation and transcription techniques (OCR and HTR): indeed, PAGE aims to support each of the different steps in the processing chain for image document analysis (from image enhancement to layout analysis to OCR).[citation needed]
The PAGE XML schema is notably used as an export and import format by automatic transcription software such as eScriptorium[2] and Transkribus.[3] It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts.[4]
References
- ↑ "PAGE-XML". July 12, 2022. https://github.com/PRImA-Research-Lab/PAGE-XML.
- ↑ "eScripta – Digital Tools and Techniques for the Study of Ancient Writing". https://escripta.hypotheses.org/.
- ↑ "How To Export Documents from Transkribus". https://readcoop.eu/transkribus/howto/how-to-export-documents-from-transkribus/.
- ↑ Kiessling, Benjamin (April 5, 2022). "The Kraken OCR system". https://kraken.re/.
External links
- Documentation
- Encoding example
- Documentation of the PAGE XML Format for Page Content in the OCR-D project, funded by Deutsche Forschungsgemeinschaft.
- Documentation "Page Content - Ground Truth and Storage"
- Documentation "Evaluation - Metadata, Profile and Results"
- Documentation "Dewarping - Ground Truth and Storage"
Original source: https://en.wikipedia.org/wiki/Page Analysis and Ground Truth Elements.
Read more |