DocIQ Model: Document Quality & Extraction
- DocIQ Model is a dual-purpose framework that combines a no-reference DIQA neural network with an adaptive information extraction toolkit for regulatory documents.
- The DIQA branch leverages modular feature fusion, a ResNet50 backbone, and multi-head regression to accurately assess quality dimensions like overall quality, sharpness, and color fidelity.
- The agentic extraction toolkit employs a planner–executor–responder loop with strict safeguards to ensure precise, traceable, and adaptive extraction from diverse regulatory documents.
DocIQ refers to two distinct models that target advanced document understanding in their respective domains: (1) a neural network for document image quality assessment (DIQA), and (2) an agentic toolkit for adaptive information extraction from regulatory documents. Both approaches are characterized by modularity, strong evaluation protocols, and innovations tailored to document-level challenges across modalities, languages, and subjective or structured targets.
1. Document Image Quality Assessment: The DocIQ Model and DIQA-5000
The DocIQ model for DIQA is designed to provide no-reference, multi-dimensional assessment of document image quality, enabling robust evaluation and downstream application in optical character recognition (OCR) and document restoration. The foundational asset, the DIQA-5000 dataset, consists of 5,000 enhanced document images derived from 500 source images subjected to diverse real-world capture distortions and enhancement pipelines. Subjective ratings on three dimensions (overall quality, sharpness, color fidelity) are aggregated from 15 experienced raters per image (Ma et al., 21 Sep 2025).
Key dataset characteristics include:
| Attribute | Value | Notes |
|---|---|---|
| Number of Images | 5,000 | 10 enhancements per 500 raw images |
| Distortion Types | 5 | Shadow, occlusion, blurring, creases, moiré patterns |
| Raters per Image | 15 | Ratings cleaned per ITU-R BT.500, MOS per quality dimension |
| Enhancement Operations | 6 (randomized) | Dewarping, demoiré, occlusion removal, deburring, etc. |
| Quality Dimensions | 3 | Overall, sharpness, color fidelity |
The DIQA-5000 dataset establishes a new subjective benchmark, with mean opinion scores (MOS) computed for each image on all three axes, providing full dynamic range coverage in each quality dimension.
2. Neural Architecture for No-Reference Document IQA
The DocIQ neural model comprises four key modules: (A) Layout Fusion Downsampler, (B) ResNet50 backbone, (C) Multi-level Feature Fusion Module, (D) Parallel Quality Regressors (Ma et al., 21 Sep 2025):
- (A) Layout Fusion Downsampler: Accepts a high-resolution document image and a corresponding layout mask encoding text, table, and figure regions. Features are extracted via spatial and layout-aware downsampling paths and combined element-wise, allowing preservation of semantic layout information at reduced computational cost.
where are lightweight strided convolutions.
- (B) Backbone Network (ResNet50): Operates on the downsampled features to generate a hierarchical stack of feature maps .
- (C) Multi-Level Feature Fusion: Implements recursive aggregation:
where each is a bottleneck–spatial–restore block. This structure enables low-level spatial detail to propagate upward and merge with increasingly abstract semantic features.
- (D) Parallel Quality Regressors: For each quality dimension (), an independent regressor predicts the distribution of rater scores, from which the MOS is computed:
3. Learning Objectives and Losses
DocIQ leverages multi-head distribution learning to predict both the full score distribution and the aggregated MOS per dimension. Two primary losses are optimized jointly:
- Kullback–Leibler Divergence () between empirical rater distributions and the predicted discrete distributions.
- MOS Regression Loss (0), penalizing deviation of predicted mean opinion score from ground truth MOS.
Total loss: 1 with 2.
The empirical approach allows the model to capture both central tendency and variability in human subjective ratings, addressing the multi-dimensional, multi-rater nature of DIQA.
4. Training Protocol and Evaluation
Inputs are preprocessed to 3 before downsampling. The training pipeline employs ResNet50 (ImageNet-pretrained) with Adam optimizer (4 initial learning rate, step decay 0.6 every 10 epochs), 60 epochs total, and a batch size of 20.
Experimental evaluation on DIQA-5000 demonstrates state-of-the-art results. Specifically:
| Quality Dimension | PLCC | SRCC |
|---|---|---|
| Overall | 0.9083 | 0.8832 |
| Sharpness | 0.9006 | 0.8615 |
| Color Fidelity | 0.8907 | 0.8666 |
| Three-dim average | ~0.8999 | ~0.8704 |
On SmartDoc-QA (OCR-quality linked), PLCC for character and word accuracy exceeds 0.91, exhibiting strong alignment with downstream OCR utility. Ablation studies confirm that layout fusion, full-rater distribution learning, and feature fusion each substantially contribute to overall model accuracy (Ma et al., 21 Sep 2025).
5. Agentic Information Extraction: DocIQ Toolkit
DocIQ in the context of regulatory information extraction refers to an agent-based system designed for highly variable, multilingual documents, notably PDFs in regulatory compliance workflows (Colakoglu et al., 15 Sep 2025). The architecture centralizes a planner–executor–responder loop:
- Planner: Given the current AgentState (5), decides the next operation via a structured, zero-temperature LLM call. The planner serializes user intent, document metadata, tool compatibility, and a complete tool call history. Loop- and misuse-detection logic ensures progress and type safety.
- Executor: Instantiates and runs the chosen tool (including OCR, direct PDF parsing, LLM-based key-value pair extraction, translation, and more), updating AgentState with outputs and recording tool usage.
- Responder: Validates and returns verified outputs (KVP or QA) only once quality thresholds and completion criteria are met; incomplete or unverifiable results are never returned.
The framework’s control flow is formalized as:
6
7
and AgentState transitions guarantee deterministic, type-checked evolution.
6. Tool Selection, Safeguards, and Adaptivity
The planner dynamically filters tool options based on document modality (scanned vs. digital), preflight detection routines (language, modality), and user intent (key-value extraction vs. QA). DocIQ enforces mandatory checks:
- No repeated tool invocation with identical inputs.
- No incompatible tool–state pairs.
- Termination after exceeding planning cycle threshold.
These constraints ensure finite execution, prevention of runaway calls (e.g., infinite OCR), and type-safe, traceable tool invocations.
Adaptivity is achieved by switching between OCR and native text tools as required, supporting cross-language operation and schema translation for output normalization.
7. Empirical Benchmarking and Architectural Contributions
Evaluation on the Declaration of Performance (DoP) dataset (52 annotated PDFs, 96% German, 85% text-PDFs, 15% scanned) covers both fixed and open schema KVP extraction and nested QA pairs. Metrics include JSON validity, key match ratio, BLEU and ROUGE-L for value similarity. For KVP extraction, DocIQ attains:
| Metric | Value | Notes |
|---|---|---|
| JSON validity | 100% | All conditions |
| KeyMatch (fixed) | 1.00 | |
| KeyMatch (open) | 0.56 | |
| BLEU | ~0.40 | English and German |
| ROUGE-L | ~0.71 | English and German |
The system demonstrates stability across scanned and born-digital documents by selecting the appropriate toolchain automatically, and outperforms LLM vision-augmented and prompt-only baselines, especially in German-dominated and nested evaluation scenarios.
Key architectural advances include: a modular ToolRegistry with typed schemas for every tool, integrated error recovery, explicit loop detection logic, and enforced separation between tool outputs and responder activation. This approach provides strong guarantees against hallucination and incomplete results, positions DocIQ as a blueprint for replicable, robust agentic information extraction pipelines (Colakoglu et al., 15 Sep 2025).
References
- "DocIQ: A Benchmark Dataset and Feature Fusion Network for Document Image Quality Assessment" (Ma et al., 21 Sep 2025)
- "An Agentic Toolkit for Adaptive Information Extraction from Regulatory Documents" (Colakoglu et al., 15 Sep 2025)