Papers
Topics
Authors
Recent
Search
2000 character limit reached

DocIQ Model: Document Quality & Extraction

Updated 28 April 2026
  • DocIQ Model is a dual-purpose framework that combines a no-reference DIQA neural network with an adaptive information extraction toolkit for regulatory documents.
  • The DIQA branch leverages modular feature fusion, a ResNet50 backbone, and multi-head regression to accurately assess quality dimensions like overall quality, sharpness, and color fidelity.
  • The agentic extraction toolkit employs a planner–executor–responder loop with strict safeguards to ensure precise, traceable, and adaptive extraction from diverse regulatory documents.

DocIQ refers to two distinct models that target advanced document understanding in their respective domains: (1) a neural network for document image quality assessment (DIQA), and (2) an agentic toolkit for adaptive information extraction from regulatory documents. Both approaches are characterized by modularity, strong evaluation protocols, and innovations tailored to document-level challenges across modalities, languages, and subjective or structured targets.

1. Document Image Quality Assessment: The DocIQ Model and DIQA-5000

The DocIQ model for DIQA is designed to provide no-reference, multi-dimensional assessment of document image quality, enabling robust evaluation and downstream application in optical character recognition (OCR) and document restoration. The foundational asset, the DIQA-5000 dataset, consists of 5,000 enhanced document images derived from 500 source images subjected to diverse real-world capture distortions and enhancement pipelines. Subjective ratings on three dimensions (overall quality, sharpness, color fidelity) are aggregated from 15 experienced raters per image (Ma et al., 21 Sep 2025).

Key dataset characteristics include:

Attribute Value Notes
Number of Images 5,000 10 enhancements per 500 raw images
Distortion Types 5 Shadow, occlusion, blurring, creases, moiré patterns
Raters per Image 15 Ratings cleaned per ITU-R BT.500, MOS per quality dimension
Enhancement Operations 6 (randomized) Dewarping, demoiré, occlusion removal, deburring, etc.
Quality Dimensions 3 Overall, sharpness, color fidelity

The DIQA-5000 dataset establishes a new subjective benchmark, with mean opinion scores (MOS) computed for each image on all three axes, providing full dynamic range coverage in each quality dimension.

2. Neural Architecture for No-Reference Document IQA

The DocIQ neural model comprises four key modules: (A) Layout Fusion Downsampler, (B) ResNet50 backbone, (C) Multi-level Feature Fusion Module, (D) Parallel Quality Regressors (Ma et al., 21 Sep 2025):

  • (A) Layout Fusion Downsampler: Accepts a high-resolution document image II and a corresponding layout mask MM encoding text, table, and figure regions. Features are extracted via spatial and layout-aware downsampling paths and combined element-wise, allowing preservation of semantic layout information at reduced computational cost.

Fdown=Dθ1(I)+Dθ2([I;M])F_{\text{down}} = \mathcal{D}_{\theta_1}(I) + \mathcal{D}_{\theta_2}([I;M])

where Dθi\mathcal{D}_{\theta_i} are lightweight strided convolutions.

  • (B) Backbone Network (ResNet50): Operates on the downsampled features to generate a hierarchical stack of feature maps {f1,f2,f3,f4}\{ f_1, f_2, f_3, f_4 \}.
  • (C) Multi-Level Feature Fusion: Implements recursive aggregation:

g1=f1 gi=ϕi(gi−1)+fi ,i=2,..,L\begin{aligned} g_1 &= f_1 \ g_i &= \phi_i(g_{i-1}) + f_i\,,\quad i=2,..,L \end{aligned}

where each ϕi\phi_i is a bottleneck–spatial–restore block. This structure enables low-level spatial detail to propagate upward and merge with increasingly abstract semantic features.

  • (D) Parallel Quality Regressors: For each quality dimension (d∈{overall,sharp,color}d \in \{\text{overall}, \text{sharp}, \text{color}\}), an independent regressor predicts the distribution of rater scores, from which the MOS is computed:

μ^(d)=1R∑r=1Ry^r(d)\hat{\mu}^{(d)} = \frac{1}{R} \sum_{r=1}^R \hat{y}^{(d)}_r

3. Learning Objectives and Losses

DocIQ leverages multi-head distribution learning to predict both the full score distribution and the aggregated MOS per dimension. Two primary losses are optimized jointly:

Total loss: MM1 with MM2.

The empirical approach allows the model to capture both central tendency and variability in human subjective ratings, addressing the multi-dimensional, multi-rater nature of DIQA.

4. Training Protocol and Evaluation

Inputs are preprocessed to MM3 before downsampling. The training pipeline employs ResNet50 (ImageNet-pretrained) with Adam optimizer (MM4 initial learning rate, step decay 0.6 every 10 epochs), 60 epochs total, and a batch size of 20.

Experimental evaluation on DIQA-5000 demonstrates state-of-the-art results. Specifically:

Quality Dimension PLCC SRCC
Overall 0.9083 0.8832
Sharpness 0.9006 0.8615
Color Fidelity 0.8907 0.8666
Three-dim average ~0.8999 ~0.8704

On SmartDoc-QA (OCR-quality linked), PLCC for character and word accuracy exceeds 0.91, exhibiting strong alignment with downstream OCR utility. Ablation studies confirm that layout fusion, full-rater distribution learning, and feature fusion each substantially contribute to overall model accuracy (Ma et al., 21 Sep 2025).

5. Agentic Information Extraction: DocIQ Toolkit

DocIQ in the context of regulatory information extraction refers to an agent-based system designed for highly variable, multilingual documents, notably PDFs in regulatory compliance workflows (Colakoglu et al., 15 Sep 2025). The architecture centralizes a planner–executor–responder loop:

  • Planner: Given the current AgentState (MM5), decides the next operation via a structured, zero-temperature LLM call. The planner serializes user intent, document metadata, tool compatibility, and a complete tool call history. Loop- and misuse-detection logic ensures progress and type safety.
  • Executor: Instantiates and runs the chosen tool (including OCR, direct PDF parsing, LLM-based key-value pair extraction, translation, and more), updating AgentState with outputs and recording tool usage.
  • Responder: Validates and returns verified outputs (KVP or QA) only once quality thresholds and completion criteria are met; incomplete or unverifiable results are never returned.

The framework’s control flow is formalized as:

MM6

MM7

and AgentState transitions guarantee deterministic, type-checked evolution.

6. Tool Selection, Safeguards, and Adaptivity

The planner dynamically filters tool options based on document modality (scanned vs. digital), preflight detection routines (language, modality), and user intent (key-value extraction vs. QA). DocIQ enforces mandatory checks:

  • No repeated tool invocation with identical inputs.
  • No incompatible tool–state pairs.
  • Termination after exceeding planning cycle threshold.

These constraints ensure finite execution, prevention of runaway calls (e.g., infinite OCR), and type-safe, traceable tool invocations.

Adaptivity is achieved by switching between OCR and native text tools as required, supporting cross-language operation and schema translation for output normalization.

7. Empirical Benchmarking and Architectural Contributions

Evaluation on the Declaration of Performance (DoP) dataset (52 annotated PDFs, 96% German, 85% text-PDFs, 15% scanned) covers both fixed and open schema KVP extraction and nested QA pairs. Metrics include JSON validity, key match ratio, BLEU and ROUGE-L for value similarity. For KVP extraction, DocIQ attains:

Metric Value Notes
JSON validity 100% All conditions
KeyMatch (fixed) 1.00
KeyMatch (open) 0.56
BLEU ~0.40 English and German
ROUGE-L ~0.71 English and German

The system demonstrates stability across scanned and born-digital documents by selecting the appropriate toolchain automatically, and outperforms LLM vision-augmented and prompt-only baselines, especially in German-dominated and nested evaluation scenarios.

Key architectural advances include: a modular ToolRegistry with typed schemas for every tool, integrated error recovery, explicit loop detection logic, and enforced separation between tool outputs and responder activation. This approach provides strong guarantees against hallucination and incomplete results, positions DocIQ as a blueprint for replicable, robust agentic information extraction pipelines (Colakoglu et al., 15 Sep 2025).

References

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DocIQ Model.