PageXML: Document Layout Annotation Schema

Updated 5 December 2025

PageXML is a specialized XML schema enabling structured layout and region annotation for digitized documents, vital for reliable OCR pipelines.
It employs a hierarchical structure with core elements like <PcGts>, Metadata, and various region tags to capture text and image zones clearly.
Its interoperability with major OCR systems and configurable workflows underpins advanced region extraction, segmentation, and post-processing in digitization projects.

PageXML is a specialized XML-based schema designed for the layout analysis and region annotation of digitized page images. It enables the structured representation of page content—including text regions, image zones, metadata, and reading order—and has become a de facto exchange format in document image analysis and OCR post-processing pipelines, particularly for historical or complex print layouts. The PageXML schema, originally developed by the PRImA Research Lab and also known as "PAGE format," underpins interoperability among an ecosystem of layout analysis, region extraction, and OCR tools (Reul et al., 2017).

1. Schema Structure and Core Elements

PageXML encodes document layout using a hierarchical structure, beginning with the <PcGts> root element, which contains:

A <Metadata> block recording creator, creation date, and optional update information.
One <Page> element per image, each annotated with attributes:
- imageFilename (string)
- imageWidth, imageHeight (integers, pixels)
Within <Page>, a flat list of region elements:
- <TextRegion id="…" type="…">
- <ImageRegion id="…" type="…">
- Optional: <GraphicRegion>, <LineRegion>, and other semantic or layout categories (not typically emitted by all tools).

Each region contains a <Coords> child, specifying the polygonal outline in page-coordinate units via a list of "x,y" pairs. Reading order, if present, is typically encoded in a separate element within the schema.

A canonical PageXML snippet:

<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
  <Metadata>
    <Creator>LAREX v1.2</Creator>
    <Created>2024-06-18T09:33:12</Created>
  </Metadata>
  <Page imageFilename="folio_012.png" imageWidth="2500" imageHeight="3400">
    <TextRegion id="r1" type="paragraph">
      <Coords points="120,250 2380,250 2380,600 120,600"/>
    </TextRegion>
    <ImageRegion id="r4" type="image">
      <Coords points="400,700 2100,700 2100,2200 400,2200"/>
    </ImageRegion>
  </Page>
</PcGts>

(Reul et al., 2017)

2. Region Extraction and Mapping to PageXML

In layout analysis tools such as LAREX, the mapping from connected components to the PageXML data model is realized via rule-based segmentation and morphological processing. The pipeline typically operates as follows:

The page image is downscaled to a standard working resolution (e.g., height 1600 px).
Two passes of morphological region-growing over binary connected components are used to extract polygonal candidate regions, each with an associated area and rectangular bounding box.
Each candidate region is classified using user-configurable rules:
- minArea
- Allowed page sub-rectangles (e.g., {left:0, right:1, top:0, bottom:0.25} for top strip)
- maxOccurrence
- Positional priorities in ambiguity resolution
Candidate regions are rescaled back to original coordinates using:

$s = \frac{H_{\mathrm{orig}}}{H_{\mathrm{work}}}, \qquad x_i^{\mathrm{orig}} = s x_i^{\mathrm{work}},\qquad y_i^{\mathrm{orig}} = s y_i^{\mathrm{work}}$

Regions are instantiated as <TextRegion> or <ImageRegion> elements within PageXML, populated with scaled polygonal outlines and semantic type fields according to the rule result.

This logic underpins the semi-automatic production of richly annotated PageXML outputs, as illustrated in before/after comparisons in the LAREX demonstration (Reul et al., 2017).

3. Configuration and Customization Paradigms

PageXML production is influenced by user-editable configuration profiles. In the case of LAREX, all parameters are maintained in a JSON-like structure comprising:

regionTypes[]: List of semantic types with per-type constraints for area, allowed subrectangles, maximum occurrences, and tiebreaking priorities.
dilation: Kernel parameters for morphological operations.
imageDetection: Minimum area and kernel definitions for image regions.
global ROI: An optional region-of-interest crop applied to every page.

Sample configuration fragment:

{
  "regionTypes": [
    {
      "typeName": "page_number",
      "minArea": 500,
      "maxOccurrence": 1,
      "positionRects": [{ "left":0, "right":1, "top":0, "bottom":0.25 }],
      "priority": 10
    },
    {
      "typeName": "paragraph",
      "minArea": 2000,
      "maxOccurrence": -1,
      "positionRects": [{ "left":0, "right":1, "top":0, "bottom":1 }],
      "priority": 1
    }
  ]
}

Types and region constraints can be extended arbitrarily to represent more complex zone taxonomies such as "signature_mark" or "heading," by updating the configuration (Reul et al., 2017).

4. Integration and Interoperability Within OCR Pipelines

The principal advantage of exporting fully conformant PageXML lies in its interoperability. Major OCR systems, including OCRopus, Kraken, Calamari, Tesseract (LSTM and legacy APIs), and hybrid frameworks like Transkribus, eScriptorium, and OCR-D, operate directly on PageXML's <TextRegion>, <TextLine>, and related definitions.

Interoperability mechanisms:

OCR engines process only those regions annotated in PageXML, enabling selective OCR, region-specific models, and hierarchical reading order constraints.
Aggregation frameworks (e.g., OCR-D workflow engine) utilize PageXML to coordinate sophisticated multi-tool pipelines for line segmentation, region annotation, and post-correction.
Downstream semantic or language modeling steps reference the type attribute, permitting font- or script-specific handling.

A typical pipeline consumes .xml (PageXML) and .png (image) pairs, ensuring that segmentation, per-region attribute semantics, and layout relationships are preserved in further processing steps (Reul et al., 2017).

5. Empirical Workflow: Before/After Demonstration

A practical illustration illuminates the transformation induced by a PageXML-driven tool:

State	Description	Key Elements
Before	Blank PageXML with only metadata and `<Page>`	No regions annotated
After	Populated PageXML post-tool (e.g., LAREX) run	Multiple `<TextRegion>`, `<ImageRegion>` with `type` and `<Coords>`

The transformation demonstrates region creation, precise polygon addition, semantic type assignment, and full compatibility with subsequent pipeline steps. This structure enables both human and automated verification, correction, and specialized post-processing (Reul et al., 2017).

6. Context, Limitations, and Adoption

PageXML underpins historical print digitization at scale, supporting robust semantic zoning and detailed layout modeling. A notable limitation is that not all tools emit the entire range of possible region types, and post-processing steps may be required to enrich or correct initial segmentation. However, the schema's extensible design and widespread adoption in OCR-D, Transkribus, and similar environments suggest a sustainable role for PageXML as the exchange layer for document image understanding.

A plausible implication is that PageXML’s schema and ecosystem serve as a general blueprint for structured page-level annotation, informing future advances in document layout understanding and complex workflow integration (Reul et al., 2017).

PDF Markdown Chat (Pro)

References (1)

LAREX - A semi-automatic open-source Tool for Layout Analysis and Region Extraction on Early Printed Books (2017)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to PageXML.