SCORE Architecture: Robust Evaluation
- SCORE Architecture is a multi-dimensional, interpretation-agnostic framework designed to robustly evaluate generative document parsing by integrating content fidelity, table, and hierarchy assessments.
- It employs an adjusted edit distance and token diagnostics that tolerate reordering and representation ambiguities, thereby ensuring semantic consistency.
- Through parallelized modules and standardized input normalization, SCORE achieves resilient benchmarking and interpretable diagnostics for diverse, multimodal document outputs.
The SCORE architecture, introduced as Structural and COntent Robust Evaluation, is a multi-dimensional, interpretation-agnostic framework for the evaluation of generative document parsing systems. It is designed to address the semantic and representational diversity inherent to multi-modal and generative outputs, where deterministic, layout-based metrics often mischaracterize meaningful variance as error. SCORE systematically integrates adjusted edit distance, fine-grained token diagnostics, table evaluation with spatial and semantic flexibility, and hierarchy-aware structure checks, providing robust and interpretable diagnostics for both content fidelity and functional document structure (Li et al., 16 Sep 2025).
1. System Architecture and Workflow
SCORE operates as a modular evaluation pipeline with the following canonical stages:
- Input normalization: Accepts arbitrary parser outputs (HTML, JSON lists, Markdown, plain text, coordinate tables, etc.) and normalizes them into a format-agnostic structure composed of canonical element types: paragraphs, figures, tables, list items, and headings. For tables, a tuple-based representation (row, col, content) abstracts away encoding-specific details.
- Module execution sequence:
- Content Fidelity Module (including adjusted edit distance and diagnostics)
- Table Evaluation Module (detection F₁, cell alignment, hierarchical TEDS)
- Hierarchy Consistency Module (label mapping, completion, F₁)
- Aggregation and structured reporting (JSON or CSV output per page)
- Parallelization: Modules consume the shared normalized element list and may execute in parallel. Caching is used to reduce redundant normalization overhead.
- End-to-end output: Per-page metrics and aggregated corpus-level statistics (mean, standard deviation, and rate of ambiguous/alternative interpretations) are collated into a final report.
2. Adjusted Edit Distance and Token Diagnostics
Traditional normalized edit distance (NED) is insufficient for generative settings due to its sensitivity to structural permutations. Thus, SCORE introduces an adjusted NED:
where is the sum over elements of class of weights times similarity:
with being NED for paragraphs/captions and a fuzzy cell-alignment score for tables. This design tolerates reordering and alternate yet semantically valid element arrangements.
Token-level diagnostics disambiguate content recall (TokensFound) and hallucination (TokensAdded), using multiset token frequencies:
3. Table Structure Evaluation and Alignment
SCORE's design for table evaluation is robust to representational ambiguity and structural divergence, eliminating the limitations of bounding-box or detector-driven metrics.
- Semantic-first detection is operationalized as a set-level matching of predicted and ground-truth tables, with F₁ defined as:
with as table precision and recall, default .
- Granular cell-level analysis incorporates spatial tolerance. For predicted and gold cell sets 0:
- Alignment uses bipartite matching under allowed index shifts (1), and a fuzzy content-similarity threshold (2, typ. 0.8).
- Alignment metrics:
3
- Hierarchical TEDS quantifies deep structural (tree-edit) differences:
4
4. Hierarchy-Aware Consistency and Label Mapping
To address representational ambiguity in element semantics, SCORE maps all raw labels to a fixed set of semantic categories (5). Evaluation proceeds via:
- Partial and completed matching: Matching pairs 6 are extended to include unmatched elements, each paired with a NOMATCH token.
- Confusion matrix and micro-averaged F₁ over the (7) label space:
8
9
This formulation penalizes missing and spurious elements in a structurally aware manner.
5. Aggregation, Reporting, and Interpretation
SCORE produces both per-page and corpus-level metrics, including all scalar outputs from fidelity, table, and hierarchy modules. Reports are structurally organized as JSON objects with entries such as NED, AdjNED, TokensFound, TokensAdded, TableDetF1, ContentAcc, IndexAcc, TEDS, and HierarchyF1.
Historical benchmarks show that, in 2–5% of pages with ambiguous tables, traditional metrics penalized valid outputs by 12–25%, distorting downstream rankings. SCORE corrected these discrepancies, supporting evaluation equivalence for alternative, semantically justified interpretations (Li et al., 16 Sep 2025).
By normalizing all outputs into a single canonical representation, SCORE can replicate object-detection table F1 (up to 0.93) using only generative outputs, without specialized detector pipelines.
6. Hyperparameters, Parallelization, and Implementation Notes
The architecture exposes the following main hyperparameters:
- 0 in F₁ calculations (default 1)
- 1: max table-cell index shift (typically 1–2)
- 2: cell-content match threshold (typ. 0.8)
- 3: element weighting in NED (commonly uniform or by token count)
Normalization is performed once, with element lists cached and supplied to all modules. Levenshtein libraries (for sequence alignment), Hungarian algorithms (for cell bipartite matching), and tree-edit libraries (for TEDS) are directly applicable.
Data flow can be concisely represented as: 4 All modules may run in parallel, and both raw and adjusted metrics are exposed to highlight the effect of interpretation-tolerant evaluation.
7. Significance and Impact
SCORE formalizes foundational principles for document parsing evaluation, explicitly quantifying and tolerating interpretation diversity while enforcing semantic rigor. It discovers latent performance patterns missed by classical metrics and supports fair benchmarking across systems with divergent but plausible generative outputs. By decoupling evaluation from rigid structure and utilizing multi-dimensional, interpretable metrics, SCORE enables more meaningful comparative assessment and is extensible to future modalities and structural targets (Li et al., 16 Sep 2025).