Document Image Quality Assessment (DIQA)

Updated 21 November 2025

Document Image Quality Assessment (DIQA) is a framework that uses both subjective ratings and objective metrics to evaluate the legibility and functionality of document images.
Recent methodologies span full-reference, no-reference, and surrogate modeling approaches, achieving high correlations (e.g., LCC/SROCC > 0.90) especially in blur-dominated scenarios.
Benchmarking on datasets like DIQA-5000 using metrics such as PSNR, SSIM, and MOS drives advances in real-time and mobile document processing applications.

Document Image Quality Assessment (DIQA) refers to the set of methodologies, metrics, and computational frameworks designed to objectively or subjectively quantify the visual quality and usability of document images. DIQA arises as a critical task in applications such as optical character recognition (OCR), information retrieval, document archival, and mobile document capture, where the legibility and recognizability of textual and graphical content are paramount. Techniques span from traditional hand-crafted feature-based models to deep learning approaches, with recent advances leveraging multi-modal LLMs and subjective human rating datasets. DIQA seeks to bridge low-level image degradations (blur, noise, illumination, compression) and high-level functional objectives (OCR accuracy, user readability).

1. DIQA Datasets and Ground-Truth Construction

DIQA research leverages both full-reference, subjective, and task-driven ground-truth labels:

Subjective human ratings: Datasets such as DIQA-5000 comprise 5,000 document images, each rated by 15 annotators for overall quality, sharpness, and color fidelity on a continuous Mean Opinion Score (MOS) scale; rating reliability is evaluated via statistical measures such as Kendall’s W, with protocols following ITU-R BT.500 to remove outliers (Ma et al., 21 Sep 2025).
Objective, task-driven ground truth: Several frameworks use downstream OCR accuracy as the criterion. For example, CG-DIQA evaluates performance by running ABBYY FineReader, Tesseract, and OmniPage on each test image and averaging their accuracies (Li et al., 2018). Other datasets (e.g., SmartDoc-QA, DocImg-QA) provide OCR-based scores for thousands of mobile-captured documents (Li et al., 2019).
Psychophysical acceptability: Datasets for raster image DIQA involve human “acceptability” judgments, binarizing ratings (e.g., mean score ≥ 3.0 on a 1–5 scale) to set a threshold for visually satisfactory images at various scanning resolutions (Yang et al., 2023).

Quality labels may also be constructed synthetically, as with the 52,094-text-line dataset in (Li et al., 2019), where each sample’s label is a parameterized function of applied Gaussian blur.

2. Metric–Based, No-Reference, and Full-Reference Approaches

DIQA methods are categorized by their dependence on reference images or domain knowledge:

Full-reference IQA: Methods such as Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index Measure (SSIM/MS-SSIM), Distance Reciprocal Distortion (DRD), and Negative Rate Metric (NRM) quantify fidelity with respect to a clean “reference” image (Singh et al., 2017, Kirsten et al., 2021). These metrics are often used in document enhancement but are inapplicable when ground truth is unavailable (Singh et al., 2017).
No-reference (blind) DIQA: Models predict quality directly from the test image alone, often via surrogate features. CG-DIQA computes sharpness via character patch gradients identified by MSER detection, relating the standard deviation of pooled patch gradients to OCR performance (Li et al., 2018). DocIQ fuses multi-level CNN features and semantic layout masks to regress multi-dimensional scores without reference images (Ma et al., 21 Sep 2025).
Surrogate modeling: Regression surrogates (e.g., SVR, ANN, Gaussian Process) are trained to predict full-reference metrics (such as F-measure for binarization) from precomputed image features, enabling quality feedback where ground truth is missing at test time (Singh et al., 2017).
Classification-based methods: Acceptability is treated as a binary classification using hand-crafted features (Tenengrad, Laplacian, GLCM, entropy) and RBF-SVMs (Yang et al., 2023).

The table below summarizes DIQA categories, typical input requirements, and principal output:

Method Class	Input	Output
Full-reference (FR-IQA)	Test & reference	Scalar image-level score
No-reference (NR-IQA/DIQA)	Test only	Scalar or vector score
Surrogate modeling	Test & processed	Predicted FR metric
Human subjectivity/MOS-driven	Test only	MOS/class distribution

3. Algorithmic Paradigms and Model Architectures

DIQA frameworks employ varied algorithmic paradigms and neural architectures:

Character-centric approaches: CG-DIQA extracts character candidates via Maximally Stable Extremal Regions (MSER), computes Sobel-based patch gradients, and uses their standard deviation as a sharpness-driven quality score, showing LCC/SROCC above 0.90 on public benchmarks (Li et al., 2018).
Text-line based CNN frameworks: The method of (Li et al., 2019) detects text lines using CTPN, evaluates line-level quality via a ResNet-based regressor trained on synthetic blur, and aggregates line scores (via area-weighted or median pooling) into a document-level score.
Deep multi-level fusion models: DocIQ processes input images with a ResNet-50 backbone and a dual-path downsampling that fuses semantic layout information, employing independent regression heads for each quality aspect (overall, sharpness, color fidelity) and distribution-matching losses (KL, EMD) (Ma et al., 21 Sep 2025).
Multi-modal LLM adaptation: DeQA-Doc adapts DeQA-Score MLLM-based architecture for DIQA, leveraging vision-language alignment, soft label strategies (Gaussian pseudo-variance, linear interpolation), high-resolution encoders (CLIP-ViT, Qwen2.5-VL), ensemble strategies, and cross-entropy/KL training to produce MOS-matched quality distributions (Gao et al., 17 Jul 2025).
Quality assessment in MLLMs: Q-Doc benchmarks MLLMs using a three-tiered framework: (i) coarse global scoring, (ii) distortion-type identification, and (iii) fine-grained severity estimation. It systematically evaluates models like DeepSeek-VL2, GPT-4o, and Llama-3.2-11B for both single- and multi-distortion detection (Huang et al., 14 Nov 2025).

4. Evaluation Metrics, Experimental Protocols, and Benchmarks

Evaluation metrics are standardized for direct comparability:

Correlational analysis: Linear Correlation Coefficient (LCC or PLCC), Spearman Rank Order Correlation Coefficient (SROCC, SRCC), and, where ground-truth distributions are available, Earth Mover’s Distance (EMD) and KL divergence are computed between predicted and reference scores per dimension (overall, sharpness, color fidelity) (Gao et al., 17 Jul 2025, Ma et al., 21 Sep 2025).
Task-driven metrics: Some methods report correlation with average OCR accuracy (across several engines), using document-wise (median of per-document correlations) and aggregate (over all test images) scoring (Li et al., 2018, Li et al., 2019).
Classification accuracy: For binary/multiclass schemes, overall, precision, recall, F1-score, and ROC-AUC are reported, as in the raster-image acceptability framework (Yang et al., 2023).
Multi-level evaluations for MLLMs: Q-Doc introduces balanced and raw accuracy for distortion-type recognition and severity ratings, as well as SRCC/PLCC for ordinal quality matching (Huang et al., 14 Nov 2025).

Key benchmarks include DocImg-QA, SmartDoc-QA, DIQA-5000, DIBCO/H-DIBCO, and synthetic line-level sets covering a spectrum of degradations, document types, and layout complexity.

5. Comparative Performance and State-of-the-Art Results

Empirical results indicate:

CG-DIQA outperforms general-purpose NR-IQA and prior hand-crafted metrics for blur-dominated degradations, achieving median LCC/SROCC >0.90, and competitive per-OCR-engine results (Li et al., 2018).
The text-line CNN approach of (Li et al., 2019) raises SmartDoc-QA correlation to 74.33%/75.72% (LCC/SROCC with median pooling), exceeding previous DIQA pipelines.
DocIQ’s multi-head feature fusion delivers PLCC/SRCC up to 0.91/0.88 on DIQA-5000 and >0.90 on SmartDoc-QA, outperforming non-specialized, strong general IQA models by several points (Ma et al., 21 Sep 2025).
DeQA-Doc ensemble achieves a final score of 0.9288 on DIQA-5000, marking a new level of accuracy and generalizability across diverse document distortions (Gao et al., 17 Jul 2025).
Q-Doc results show that while MLLMs display limited zero-shot DIQA capability (e.g., DeepSeek-VL2: SRCC 0.4474, GPT-4o 0.1321), Chain-of-Thought prompts give modest but consistent improvements, especially for multi-distortion recognition (DeepSeek-VL2 + CoT: multi Acc_bal 51.20%) (Huang et al., 14 Nov 2025).
Hand-crafted feature + SVM models, when augmented for class imbalance and variance (e.g., noise simulation), achieve above 93% accuracy on scanning acceptability tasks, with clear resolution-dependent performance trends (Yang et al., 2023).

6. Practical Considerations, Limitations, and Extensions

Generalization challenges: Hand-crafted and early learning methods are often blur-centric, unable to robustly capture complex degradations (shadows, moiré, occlusion) or generalize across scripts and layouts (Li et al., 2018, Li et al., 2019, Yang et al., 2023).
Detector dependencies: Text- or line-aware pipelines are bottlenecked by detection recall and precision; errors in region localization cascade to erroneous quality estimation (Li et al., 2019).
Variance modeling: Most methods regress or classify MOS, but few explicitly estimate annotator variance, leading to possible underestimation of uncertainty in ambiguous cases (Gao et al., 17 Jul 2025, Ma et al., 21 Sep 2025).
Resolution scaling: High-resolution support is critical for document DIQA; models like DeQA-Doc address this by modifying positional embeddings or using inherently flexible backbones (Gao et al., 17 Jul 2025).
Ensembles and prompt engineering: Ensemble models consistently outperform single baselines; prompt-ensemble adds negligible benefit over model-ensemble in current MLLM-based DIQA (Gao et al., 17 Jul 2025). CoT prompt designs increase MLLM interpretability and accuracy in zero-shot settings (Huang et al., 14 Nov 2025).
Real-time and embedded applications: SVM- and feature-based approaches can operate at ~10 ms per image, suitable for scanning firmware. CNN and MLLM-based models, though more accurate, are computationally more intensive and may require tailoring for edge or mobile (Yang et al., 2023, Gao et al., 17 Jul 2025, Ma et al., 21 Sep 2025).

7. Future Directions

Emerging research directions in DIQA include:

Direct OCR-loss integration: Training DIQA models with OCR accuracy or text readability as part of the loss landscape (Ma et al., 21 Sep 2025).
Self-supervised pretraining: Leveraging unlabeled document corpora to further improve robustness and transferability (Ma et al., 21 Sep 2025).
Variance and uncertainty estimation: Dynamic modeling of annotator variance per image (Gao et al., 17 Jul 2025).
Finer-grained scoring: Moving beyond five-level discrete tokens to score distributions or continuous quality spaces (Gao et al., 17 Jul 2025).
Cross-modal and multi-task learning: Joint DIQA and OCR training, or hybrid approaches incorporating document analytics, semantic segmentation, and layout understanding (Gao et al., 17 Jul 2025, Huang et al., 14 Nov 2025).
Distillation for deployment: Compressing MLLM-based DIQA models into lightweight specialist networks for deployment in scanning devices or real-time capture systems (Gao et al., 17 Jul 2025).
Extension to new domains: Expanding DIQA to multi-script, handwritten, or historical documents, and to multi-page PDF pipelines (Gao et al., 17 Jul 2025, Ma et al., 21 Sep 2025).

The field continues to advance with the synthesis of domain-specific datasets, task-aligned architecture innovations, and rigorous benchmarking protocols, ensuring that DIQA remains a central concern in practical document processing, archiving, and automated capture applications.