Page Analysis and Ground Truth Elements

From HandWiki

Page Analysis and Ground Truth Elements (PAGE) is an XML standard for encoding digitised documents.[1] Comparable to ALTO (XML), it allows the organisation and structure of a page and its contents to be described.


  • page content (regions, lines of text, words, glyphs, reading order, text content, ...)
  • the evaluation of the layout analysis (evaluation profiles, evaluation results, ...)
  • the cutting of the document image (cutting grids)


The PAGE XML schema is notably used as an export and import format by automatic transcription software such as eScriptorium[2] and Transkribus.[3] It is also an export format used by Kraken, a turnkey OCR system optimised for documents in historical and non-Latin scripts[4] and by the OCR software Tesseract.[5]

References