RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing

Published 26 Apr 2026 in cs.CV and cs.AI | (2604.23644v1)

Abstract: Intelligent document processing pipelines extract structured entities (tables, images, and text) from documents for use in downstream systems such as knowledge bases, retrieval-augmented generation, and analytics. A persistent limitation of existing pipelines is that extraction output is produced without any intrinsic mechanism to verify whether it faithfully represents the source. Model-internal confidence scores measure inference certainty, not correspondence to the document, and extraction errors pass silently into downstream consumers. We present Reconstruction as Validation (RaV-IDP), a document processing pipeline that introduces reconstruction as a first-class architectural component. After each entity is extracted, a dedicated reconstructor renders the extracted representation back into a form comparable to the original document region, and a comparator scores fidelity between the reconstruction and the unmodified source crop. This fidelity score is a grounded, label-free quality signal. When fidelity falls below a per-entity-type threshold, a structured GPT-4.1 vision fallback is triggered and the validation loop repeats. We enforce a bootstrap constraint: the comparator always anchors against the original document region, never against the extraction, preventing the validation from becoming circular. We further propose a per-stage evaluation framework pairing each pipeline component with an appropriate benchmark. The code pipeline is publicly available at https://github.com/pritesh-2711/RaV-IDP for experimentation and use.

Abstract PDF Upgrade to Chat

Authors (1)

Pritesh Jha

Summary

The paper introduces a reconstruction-as-validation method that quantitatively compares reconstructed entities with original document regions for enhanced extraction fidelity.
It employs a modular pipeline with specialized extractors, reconstructors, and a GPT-4.1 fallback to robustly handle text, table, and image entities.
Empirical results show improved extraction accuracy and cost efficiency, with high fidelity scores and effective error recovery mechanisms.

RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing

Motivation and Problem Setting

Faithful and reliable intelligent document processing (IDP) is foundational for downstream systems that depend on accurate extraction of structured entities—tables, images, and text—from documents. Existing systems, encompassing both OSS and commercial pipelines, lack any grounded, document-aligned validation mechanism: extraction outputs are accepted or rejected on model confidence, which reflects prediction certainty rather than correspondence to the source. This leads to silent error propagation, as erroneously extracted content is indistinguishable (by the pipeline) from faithful extractions, degrading knowledge bases, retrieval-augmented generation (RAG) systems, and analytics.

RaV-IDP introduces Reconstruction-as-Validation (RaV) as a fundamental architectural principle: each entity extraction is validated by reconstructing the extracted representation and comparing it quantitatively to the original document region (pixel crop). This provides a completely label-free, model-agnostic, document-grounded fidelity signal, directly addressing the faithfulness problem without reliance on ground-truth annotation or model internals.

Architecture and Methodology

RaV-IDP is implemented as a modular pipeline comprising:

Document Quality Classifier: Assigns a quality class (e.g., clean, scanned, photographed) to each page, serving as a policy router for downstream region-level pre-processing.
Layout Detector: Identifies and classifies spatial regions (tables, images, text, etc.) and records immutable pixel crops, strictly enforcing the bootstrap constraint necessary for non-circular validation.
Pre-processor: Applies class-specific correction (e.g., deskew, deblur) to layout detector output.
Entity Router and Extractors: Entity-type-specific extractors (Docling/TableTransformer for tables, PyMuPDF for images, Docling/TrOCR for text).
Reconstructors: Render the extracted representation back to a comparable form (image or text).
Comparators (with Fidelity Scoring): Compute fidelity between reconstructions and the original crop using domain-relevant metrics per entity type:
- Tables: $f_\text{table} = 0.4 \cdot \text{SSIM}_\text{binarized} + 0.6 \cdot f_\text{struct}$ , with $f_\text{struct}$ reflecting row/column match and cell-level CER.
- Images: Weighted combination of $p$ Hash similarity, sharpness ratio, and spatial caption verification.
- Text: $f_\text{text} = \max(0, 1 - \text{CER})$ , measured against independent re-OCR or embedded stream.
Fidelity Gate & Fallback: Entities with fidelity below an empirically chosen threshold are routed to a GPT-4.1 vision fallback, with the validation loop repeated.
Context and Semantic Enricher: Provides spatial, semantic, and provenance context; image entities are semantically enriched with type, description, extracted text, and chart structure (via GPT-4.1).

The critical bootstrap constraint—comparator receives only the original document crop, never the extraction—prevents circular validation.

Evaluation Protocol

RaV-IDP employs per-component isolation in its evaluation protocol, pairing each pipeline module with a dedicated benchmark and metric. Datasets span DocLayNet (multi-domain layout), PubTabNet (table structure), ScanBank (image extraction), FUNSD and arXiv PDFs (text), and DocVQA (end-to-end entity-driven QA).

Ablation studies test the full pipeline, gate-only (validation without fallback), and no-RaV (primary extractor only) operation.

Empirical Results

Table Extraction on PubTabNet: Row and column accuracy are $0.596$ and $0.584$ respectively, with a cell-level CER of $0.405$. 61.2% of tables pass the fidelity gate at $t = 0.75$ .

Image Extraction and Enrichment: On ScanBank, perfect extraction (mean fidelity $0.98$), 100% description coverage, and 62.7% structured-data extraction for charts/diagrams.

Text Extraction: On FUNSD (scanned forms), mean CER is $0.517$; on native arXiv PDFs, mean CER drops to $f_\text{struct}$ 0 with a 97.1% pass rate, confirming the pipeline tracks ground-truth text with high fidelity in native PDFs.

Fidelity Reliability: Spearman $f_\text{struct}$ 1 between fidelity score and ground-truth quality is $f_\text{struct}$ 2 for tables (p = $f_\text{struct}$ 3), $f_\text{struct}$ 4 for native PDFs (n = $f_\text{struct}$ 5), and $f_\text{struct}$ 6 for scanned forms. Optimal binary F1 on table acceptance is $f_\text{struct}$ 7.

Fallback Efficacy: GPT-4.1 vision fallback recovers 38.1% of failed table extractions and 24.5% of failed text regions.

End-to-End (DocVQA): The pipeline attains 0.4224 ANLS, outperforming all OSS baselines (Unstructured: 0.3910, Docling: 0.3844, Marker, LlamaParse). Exclusion-only gate collapses ANLS to 0.1408, with a 29.7% pipeline error rate, confirming that fallback recovery—not filtering—drives performance improvement. GPT-4.1 vision direct reading (upper bound) achieves 0.9372 ANLS; RaV-IDP provides structured extraction with provenance and cost control.

Efficiency: Selective fallback reduces API cost by 71% compared to always-on—only 6.6% of entities invoke fallback.

Limitations

Bootstrap Constraint Blind Spot: If both extractor and reconstructor share systematic errors (e.g., same OCR implementation misreading a glyph), fidelity cannot identify the error.
FUNSD Pass Rate: Only 6% due to overly strict thresholds for degraded scans; adaptive thresholding is needed for noisy inputs.
Image Fidelity: Unmeasured on non-GT crops; a benchmark with varying image quality is required.
DocVQA Dataset: Composed only of scanned documents; native-PDF advantages are not verifiable end-to-end without an appropriate dataset.
Quality Classifier: Currently rule-based; not yet a learned model.
LLM Dependency: Fallback recovery and enrichment require a vision-capable LLM API; pipeline degrades to primary-only in air-gapped/cost-constrained scenarios.

Theoretical and Practical Implications

RaV-IDP formalizes a document-grounded, model-agnostic validation strategy, making label-free fidelity estimation tractable and reliable in production settings. By disentangling model confidence from document correspondence, the pipeline allows robust error handling, provenance tracking, and cost-aware QA escalation. The per-entity, per-component staging enables fine-grained analysis and targeted improvement, which is not possible with monolithic end-to-end benchmarks.

Semantic enrichment for images closes the gap for RAG and VQA systems, rendering "opaque" pixel crops into structured, retrievable knowledge objects. This architectural innovation positions IDP outputs as first-class, indexable entries for downstream LLM integration.

Directions for Future Work

Design and training of a high-accuracy, learned document quality classifier (SmartDoc-QA or similar).
Expansion of the fidelity reliability study to diversified image corruption benchmarks.
Calibration strategies for per-entity-type acceptance thresholds and cost/model tradeoff optimization.
Evaluation on native-PDF question-answering datasets to surface strengths masked by scanned-only corpora.
Exploration of self-supervised reconstructors for robust error detection even under systematic extractor bias.

Conclusion

RaV-IDP advances IDP pipeline reliability by introducing a reconstruction-based, label-free, empirically validated quality signal, delivering provenance-rich structured outputs with explicit fidelity scoring. Experiments confirm that fidelity gate and selective fallback notably improve open-source IDP effectiveness. The framework’s modular, dataset-grounded evaluation protocol ensures both theoretical soundness and practical deployability. These contributions provide a robust foundation for trustworthy document extraction and reasoned downstream application, with immediate relevance for knowledge-centric AI pipelines.

Markdown Report Issue