- The paper introduces a reconstruction-as-validation method that quantitatively compares reconstructed entities with original document regions for enhanced extraction fidelity.
- It employs a modular pipeline with specialized extractors, reconstructors, and a GPT-4.1 fallback to robustly handle text, table, and image entities.
- Empirical results show improved extraction accuracy and cost efficiency, with high fidelity scores and effective error recovery mechanisms.
RaV-IDP: A Reconstruction-as-Validation Framework for Faithful Intelligent Document Processing
Motivation and Problem Setting
Faithful and reliable intelligent document processing (IDP) is foundational for downstream systems that depend on accurate extraction of structured entitiesโtables, images, and textโfrom documents. Existing systems, encompassing both OSS and commercial pipelines, lack any grounded, document-aligned validation mechanism: extraction outputs are accepted or rejected on model confidence, which reflects prediction certainty rather than correspondence to the source. This leads to silent error propagation, as erroneously extracted content is indistinguishable (by the pipeline) from faithful extractions, degrading knowledge bases, retrieval-augmented generation (RAG) systems, and analytics.
RaV-IDP introduces Reconstruction-as-Validation (RaV) as a fundamental architectural principle: each entity extraction is validated by reconstructing the extracted representation and comparing it quantitatively to the original document region (pixel crop). This provides a completely label-free, model-agnostic, document-grounded fidelity signal, directly addressing the faithfulness problem without reliance on ground-truth annotation or model internals.
Architecture and Methodology
RaV-IDP is implemented as a modular pipeline comprising:
- Document Quality Classifier: Assigns a quality class (e.g., clean, scanned, photographed) to each page, serving as a policy router for downstream region-level pre-processing.
- Layout Detector: Identifies and classifies spatial regions (tables, images, text, etc.) and records immutable pixel crops, strictly enforcing the bootstrap constraint necessary for non-circular validation.
- Pre-processor: Applies class-specific correction (e.g., deskew, deblur) to layout detector output.
- Entity Router and Extractors: Entity-type-specific extractors (Docling/TableTransformer for tables, PyMuPDF for images, Docling/TrOCR for text).
- Reconstructors: Render the extracted representation back to a comparable form (image or text).
- Comparators (with Fidelity Scoring): Compute fidelity between reconstructions and the original crop using domain-relevant metrics per entity type:
- Tables: ftableโ=0.4โ
SSIMbinarizedโ+0.6โ
fstructโ, with fstructโ reflecting row/column match and cell-level CER.
- Images: Weighted combination of pHash similarity, sharpness ratio, and spatial caption verification.
- Text: ftextโ=max(0,1โCER), measured against independent re-OCR or embedded stream.
- Fidelity Gate & Fallback: Entities with fidelity below an empirically chosen threshold are routed to a GPT-4.1 vision fallback, with the validation loop repeated.
- Context and Semantic Enricher: Provides spatial, semantic, and provenance context; image entities are semantically enriched with type, description, extracted text, and chart structure (via GPT-4.1).
The critical bootstrap constraintโcomparator receives only the original document crop, never the extractionโprevents circular validation.
Evaluation Protocol
RaV-IDP employs per-component isolation in its evaluation protocol, pairing each pipeline module with a dedicated benchmark and metric. Datasets span DocLayNet (multi-domain layout), PubTabNet (table structure), ScanBank (image extraction), FUNSD and arXiv PDFs (text), and DocVQA (end-to-end entity-driven QA).
Ablation studies test the full pipeline, gate-only (validation without fallback), and no-RaV (primary extractor only) operation.
Empirical Results
Table Extraction on PubTabNet: Row and column accuracy are $0.596$ and $0.584$ respectively, with a cell-level CER of $0.405$. 61.2% of tables pass the fidelity gate at t=0.75.
Image Extraction and Enrichment: On ScanBank, perfect extraction (mean fidelity $0.98$), 100% description coverage, and 62.7% structured-data extraction for charts/diagrams.
Text Extraction: On FUNSD (scanned forms), mean CER is $0.517$; on native arXiv PDFs, mean CER drops to fstructโ0 with a 97.1% pass rate, confirming the pipeline tracks ground-truth text with high fidelity in native PDFs.
Fidelity Reliability: Spearman fstructโ1 between fidelity score and ground-truth quality is fstructโ2 for tables (p = fstructโ3), fstructโ4 for native PDFs (n = fstructโ5), and fstructโ6 for scanned forms. Optimal binary F1 on table acceptance is fstructโ7.
Fallback Efficacy: GPT-4.1 vision fallback recovers 38.1% of failed table extractions and 24.5% of failed text regions.
End-to-End (DocVQA): The pipeline attains 0.4224 ANLS, outperforming all OSS baselines (Unstructured: 0.3910, Docling: 0.3844, Marker, LlamaParse). Exclusion-only gate collapses ANLS to 0.1408, with a 29.7% pipeline error rate, confirming that fallback recoveryโnot filteringโdrives performance improvement. GPT-4.1 vision direct reading (upper bound) achieves 0.9372 ANLS; RaV-IDP provides structured extraction with provenance and cost control.
Efficiency: Selective fallback reduces API cost by 71% compared to always-onโonly 6.6% of entities invoke fallback.
Limitations
- Bootstrap Constraint Blind Spot: If both extractor and reconstructor share systematic errors (e.g., same OCR implementation misreading a glyph), fidelity cannot identify the error.
- FUNSD Pass Rate: Only 6% due to overly strict thresholds for degraded scans; adaptive thresholding is needed for noisy inputs.
- Image Fidelity: Unmeasured on non-GT crops; a benchmark with varying image quality is required.
- DocVQA Dataset: Composed only of scanned documents; native-PDF advantages are not verifiable end-to-end without an appropriate dataset.
- Quality Classifier: Currently rule-based; not yet a learned model.
- LLM Dependency: Fallback recovery and enrichment require a vision-capable LLM API; pipeline degrades to primary-only in air-gapped/cost-constrained scenarios.
Theoretical and Practical Implications
RaV-IDP formalizes a document-grounded, model-agnostic validation strategy, making label-free fidelity estimation tractable and reliable in production settings. By disentangling model confidence from document correspondence, the pipeline allows robust error handling, provenance tracking, and cost-aware QA escalation. The per-entity, per-component staging enables fine-grained analysis and targeted improvement, which is not possible with monolithic end-to-end benchmarks.
Semantic enrichment for images closes the gap for RAG and VQA systems, rendering "opaque" pixel crops into structured, retrievable knowledge objects. This architectural innovation positions IDP outputs as first-class, indexable entries for downstream LLM integration.
Directions for Future Work
- Design and training of a high-accuracy, learned document quality classifier (SmartDoc-QA or similar).
- Expansion of the fidelity reliability study to diversified image corruption benchmarks.
- Calibration strategies for per-entity-type acceptance thresholds and cost/model tradeoff optimization.
- Evaluation on native-PDF question-answering datasets to surface strengths masked by scanned-only corpora.
- Exploration of self-supervised reconstructors for robust error detection even under systematic extractor bias.
Conclusion
RaV-IDP advances IDP pipeline reliability by introducing a reconstruction-based, label-free, empirically validated quality signal, delivering provenance-rich structured outputs with explicit fidelity scoring. Experiments confirm that fidelity gate and selective fallback notably improve open-source IDP effectiveness. The frameworkโs modular, dataset-grounded evaluation protocol ensures both theoretical soundness and practical deployability. These contributions provide a robust foundation for trustworthy document extraction and reasoned downstream application, with immediate relevance for knowledge-centric AI pipelines.