Papers
Topics
Authors
Recent
Search
2000 character limit reached

SCORE Architecture: Robust Evaluation

Updated 31 May 2026
  • SCORE Architecture is a multi-dimensional, interpretation-agnostic framework designed to robustly evaluate generative document parsing by integrating content fidelity, table, and hierarchy assessments.
  • It employs an adjusted edit distance and token diagnostics that tolerate reordering and representation ambiguities, thereby ensuring semantic consistency.
  • Through parallelized modules and standardized input normalization, SCORE achieves resilient benchmarking and interpretable diagnostics for diverse, multimodal document outputs.

The SCORE architecture, introduced as Structural and COntent Robust Evaluation, is a multi-dimensional, interpretation-agnostic framework for the evaluation of generative document parsing systems. It is designed to address the semantic and representational diversity inherent to multi-modal and generative outputs, where deterministic, layout-based metrics often mischaracterize meaningful variance as error. SCORE systematically integrates adjusted edit distance, fine-grained token diagnostics, table evaluation with spatial and semantic flexibility, and hierarchy-aware structure checks, providing robust and interpretable diagnostics for both content fidelity and functional document structure (Li et al., 16 Sep 2025).

1. System Architecture and Workflow

SCORE operates as a modular evaluation pipeline with the following canonical stages:

  • Input normalization: Accepts arbitrary parser outputs (HTML, JSON lists, Markdown, plain text, coordinate tables, etc.) and normalizes them into a format-agnostic structure composed of canonical element types: paragraphs, figures, tables, list items, and headings. For tables, a tuple-based representation (row, col, content) abstracts away encoding-specific details.
  • Module execution sequence:
  1. Content Fidelity Module (including adjusted edit distance and diagnostics)
  2. Table Evaluation Module (detection F₁, cell alignment, hierarchical TEDS)
  3. Hierarchy Consistency Module (label mapping, completion, F₁)
  4. Aggregation and structured reporting (JSON or CSV output per page)
  • Parallelization: Modules consume the shared normalized element list and may execute in parallel. Caching is used to reduce redundant normalization overhead.
  • End-to-end output: Per-page metrics and aggregated corpus-level statistics (mean, standard deviation, and rate of ambiguous/alternative interpretations) are collated into a final report.

2. Adjusted Edit Distance and Token Diagnostics

Traditional normalized edit distance (NED) is insufficient for generative settings due to its sensitivity to structural permutations. Thus, SCORE introduces an adjusted NED:

NEDadj(s,g)=max(NED(s,g), kKWkWtotal)\mathrm{NED}_{\rm adj}(s,g) = \max\left( \mathrm{NED}(s,g),\ \frac{\sum_{k\in\mathcal K}W_k}{W_{\rm total}} \right)

where WkW_k is the sum over elements of class kk of weights times similarity:

Wk=eikwiSimk(ei,g),Wtotal=kWkW_k = \sum_{e_i\,\in\,k} w_i \,\mathrm{Sim}_k(e_i,g), \quad W_\text{total} = \sum_{k} W_k

with Simk\mathrm{Sim}_k being NED for paragraphs/captions and a fuzzy cell-alignment score for tables. This design tolerates reordering and alternate yet semantically valid element arrangements.

Token-level diagnostics disambiguate content recall (TokensFound) and hallucination (TokensAdded), using multiset token frequencies:

TokensFound(s,g)=tmin(freqs(t),freqg(t))tfreqg(t)\mathrm{TokensFound}(s,g) = \frac{\sum_t \min(\mathrm{freq}_s(t),\,\mathrm{freq}_g(t))}{\sum_t \mathrm{freq}_g(t)}

TokensAdded(s,g)=tmax(0,freqs(t)freqg(t))tfreqs(t)\mathrm{TokensAdded}(s,g) = \frac{\sum_t \max(0,\mathrm{freq}_s(t)-\mathrm{freq}_g(t))}{\sum_t \mathrm{freq}_s(t)}

3. Table Structure Evaluation and Alignment

SCORE's design for table evaluation is robust to representational ambiguity and structural divergence, eliminating the limitations of bounding-box or detector-driven metrics.

  • Semantic-first detection is operationalized as a set-level matching of predicted and ground-truth tables, with F₁ defined as:

Fβ=(1+β2)PRβ2P+RF_\beta = \frac{(1+\beta^2)\,P\,R}{\beta^2\,P + R}

with P,RP, R as table precision and recall, default β=1\beta=1.

  • Granular cell-level analysis incorporates spatial tolerance. For predicted and gold cell sets WkW_k0:

    • Alignment uses bipartite matching under allowed index shifts (WkW_k1), and a fuzzy content-similarity threshold (WkW_k2, typ. 0.8).
    • Alignment metrics:

    WkW_k3

  • Hierarchical TEDS quantifies deep structural (tree-edit) differences:

WkW_k4

4. Hierarchy-Aware Consistency and Label Mapping

To address representational ambiguity in element semantics, SCORE maps all raw labels to a fixed set of semantic categories (WkW_k5). Evaluation proceeds via:

  • Partial and completed matching: Matching pairs WkW_k6 are extended to include unmatched elements, each paired with a NOMATCH token.
  • Confusion matrix and micro-averaged F₁ over the (WkW_k7) label space:

WkW_k8

WkW_k9

This formulation penalizes missing and spurious elements in a structurally aware manner.

5. Aggregation, Reporting, and Interpretation

SCORE produces both per-page and corpus-level metrics, including all scalar outputs from fidelity, table, and hierarchy modules. Reports are structurally organized as JSON objects with entries such as NED, AdjNED, TokensFound, TokensAdded, TableDetF1, ContentAcc, IndexAcc, TEDS, and HierarchyF1.

Historical benchmarks show that, in 2–5% of pages with ambiguous tables, traditional metrics penalized valid outputs by 12–25%, distorting downstream rankings. SCORE corrected these discrepancies, supporting evaluation equivalence for alternative, semantically justified interpretations (Li et al., 16 Sep 2025).

By normalizing all outputs into a single canonical representation, SCORE can replicate object-detection table F1 (up to 0.93) using only generative outputs, without specialized detector pipelines.

6. Hyperparameters, Parallelization, and Implementation Notes

The architecture exposes the following main hyperparameters:

  • kk0 in F₁ calculations (default 1)
  • kk1: max table-cell index shift (typically 1–2)
  • kk2: cell-content match threshold (typ. 0.8)
  • kk3: element weighting in NED (commonly uniform or by token count)

Normalization is performed once, with element lists cached and supplied to all modules. Levenshtein libraries (for sequence alignment), Hungarian algorithms (for cell bipartite matching), and tree-edit libraries (for TEDS) are directly applicable.

Data flow can be concisely represented as: kk4 All modules may run in parallel, and both raw and adjusted metrics are exposed to highlight the effect of interpretation-tolerant evaluation.

7. Significance and Impact

SCORE formalizes foundational principles for document parsing evaluation, explicitly quantifying and tolerating interpretation diversity while enforcing semantic rigor. It discovers latent performance patterns missed by classical metrics and supports fair benchmarking across systems with divergent but plausible generative outputs. By decoupling evaluation from rigid structure and utilizing multi-dimensional, interpretable metrics, SCORE enables more meaningful comparative assessment and is extensible to future modalities and structural targets (Li et al., 16 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SCORE Architecture.