DIQA-5000: Benchmark for DIQA Models
- The DIQA-5000 dataset is a large-scale, subjectively rated collection featuring 5,000 document images with varied real-world distortions.
- It employs a comprehensive enhancement pipeline and a rigorous subjective rating protocol across three perceptual dimensions, enabling effective benchmarking of DIQA models such as DocIQ.
- Quantitative metrics like PLCC and SRCC highlight DocIQ’s superior performance compared to other state-of-the-art models in document image quality assessment.
The DIQA-5000 dataset is a large-scale, subjectively rated resource for document image quality assessment (DIQA). It consists of 5,000 enhanced document images derived from real-world printed materials with a range of distortion types and evaluated across three perceptual quality dimensions. The dataset underpins the evaluation and development of DIQA models such as DocIQ, which employs feature fusion and document layout analysis to predict subjective ratings, and it benchmarks both human and machine assessment of document image quality across multiple axes (Ma et al., 21 Sep 2025).
1. Dataset Construction and Distortion Taxonomy
The source images for DIQA-5000 originate from 500 unique photographs of printed paper documents—produced from publicly available PDFs at 300 dpi—which encompass a representative variety of layouts: pure text, tables, and mixed text–graphics. Each source image is afflicted with one of five major distortion classes with 100 images per type:
- Shadow (uneven lighting)
- Occlusion (partial obstruction)
- Blurring (motion blur or defocus)
- Creases (physical folds)
- Moiré patterns (screen/capture artifacts)
An enhancement pipeline, instantiated through randomized combinations of six possible processing stages—dewarp (three implementations), demoiré (two), occlusion removal (two), deblur (three), deshadow (four), and appearance enhancement (nine)—generates 10 unique restored variants of each distorted document. This strategy yields 5,000 finalized images that collectively span both broad and subtle quality variations, capturing diverse real-world challenges for DIQA systems.
2. Subjective Rating Protocol
Rigorous subjective assessment is achieved via a panel of 23 experienced raters. The set of 5,000 images is partitioned into five balanced batches, each containing 1,000 images and evaluated by 15 raters (ensuring all images receive 15 independent judgments per metric). Each image is scored independently for three quality dimensions:
- Overall Quality
- Sharpness
- Color Fidelity
All scores are recorded on a five-point Likert scale (1 = very poor, 5 = excellent). The protocol incorporates outlier and inconsistency removal aligned with ITU-R BT.500 standards, which include normalization and z-score based screening of rater reliability. Following cleaning, a Mean Opinion Score (MOS) is computed for image in dimension as:
where is the score from rater , and after data cleaning.
3. Dataset Structure and Data Splittings
DIQA-5000 comprises an equal representation of five distortion classes with 1,000 images each. The recommended data split, utilized throughout the benchmark experiments, is:
- 80% for training (4,000 images)
- 20% for testing (1,000 images)
- Optionally, ~10% of the training set (e.g., 360 images) held out for validation
Splits are distortion-balanced to ensure each class is proportionally represented in each partition. All three MOS dimensions span the full [1, 5] Likert range, with sharpness scores exhibiting a modest right-skew (mean shifted toward higher values), while color fidelity displays a relatively flat distribution.
4. Evaluation Metrics
Performance is quantified by two correlation-based metrics calculated between predicted and subjective scores:
- Pearson Linear Correlation Coefficient (PLCC):
- Spearman Rank Correlation Coefficient (SRCC):
where denotes the difference between the ranks of the prediction and the ground truth for the -th sample.
5. DocIQ Model Architecture
The DocIQ model, designed for no-reference DIQA and evaluated using DIQA-5000, is hierarchically modular:
- Layout Fusion Downsampler: Implements dual-path downsampling, with one path as standard spatial operations and the other concatenating the image and its semantic layout mask (delineating text, tables, figures), then downsampling. Both streams are fused to retain structural cues at 0 resolution.
- Backbone Network: Utilizes an ImageNet-pretrained ResNet-50 to extract hierarchical, multiscale representations.
- Feature Fusion Module: At each stage 1 of the backbone, feature maps 2 undergo a series of bottleneck convolutions, spatial transforms, and inverse bottlenecks, then are recursively added to the next stage. This hyper-structure propagates both low-level detail and high-level semantics into a unified global feature 3.
- Parallel Quality Regressors: Three independent head networks, one per quality dimension, simultaneously predict all 15 rater scores 4. Each head consists of a shared linear layer, a dimension-specific linear layer, and outputs 15 scalars.
The training objective is the mean squared error across batches 5, dimensions 6, and per-image ratings 7:
8
After training, per-dimension MOS is derived by averaging the 15 predicted rater scores for each image.
6. Benchmark Performance and Comparative Results
The DIQA-5000 benchmark, using an 80/20 train/test split, enables quantitative comparison between DocIQ and several state-of-the-art image quality assessment models. Summary results in terms of PLCC/SRCC:
| Method | Overall | Sharpness | Color Fidelity |
|---|---|---|---|
| DBCNN | 0.5869/0.5421 | 0.6163/0.6037 | 0.6335/0.6399 |
| HyperIQA | 0.8437/0.8024 | 0.8542/0.8197 | 0.8439/0.8155 |
| MUSIQ | 0.8585/0.8554 | 0.8698/0.8460 | 0.8460/0.8383 |
| RichIQA | 0.8660/0.8541 | 0.8770/0.8357 | 0.8622/0.8557 |
| StairIQA | 0.8502/0.8004 | 0.8671/0.8359 | 0.8691/0.8476 |
| TReS | 0.8628/0.8080 | 0.8800/0.8267 | 0.8658/0.8338 |
| DocIQ | 0.9083/0.8832 | 0.9006/0.8615 | 0.8907/0.8666 |
DocIQ demonstrates superior performance across all dimensions: aggregate PLCC 9 and SRCC 0. On the SmartDoc-QA OCR-focused dataset, DocIQ achieves character-level accuracy SRCC = 0.9086, PLCC = 0.9218; word-level accuracy SRCC = 0.8989, PLCC = 0.9107. This confirms the advantage of DocIQ’s feature and layout fusion—alongside parallel prediction heads—in modeling document-specific quality (Ma et al., 21 Sep 2025).
7. Significance and Applications
DIQA-5000 establishes a rigorous foundation for multi-dimensional DIQA benchmarking, enabling both subjective and objective evaluation pipelines for document restoration, OCR preprocessing, and imaging workflow optimization. The dataset’s architecture supports nuanced analysis by encompassing multiple distortion types, restoration scenarios, and explicit subjective dimensions. The associated DocIQ model exemplifies integrated exploitation of document structure and content-level features, and its architecture supports concurrent prediction of orthogonal perceptual qualities—critical for downstream applications with varying sensitivity to visual artifacts. A plausible implication is the potential for fine-grained quality control and perceptual optimization in digital document workflows.