iDETEX: Unified Multimodal IQA Model

Updated 27 October 2025

iDETEX is a unified multimodal large language model that integrates visual and textual data to perform quality grounding, perception, and description tasks.
It employs specialized augmentation strategies, including spatial perturbation, query-style alignment, and score-aware techniques, to enhance distortion detection and explanation.
Validated on industry benchmarks like ViDA-UGC and ICCV MIPI, iDETEX delivers robust, high-resolution image quality assessments with transparent, human-aligned justifications.

iDETEX is a unified multimodal LLM (MLLM) for Image Quality Assessment (IQA), capable of performing detailed and interpretable evaluation of visual content. It simultaneously addresses quality grounding, perception, and description tasks, advancing IQA from scalar score prediction to holistic, human-aligned assessment. iDETEX integrates a visual-language backbone, specialized data augmentations, and high-resolution processing to provide robust, accurate, and interpretable outputs validated on industry benchmarks and competitive challenges.

1. Unified Model Architecture

iDETEX employs a visual-language backbone, notably InternVL3, to fuse image and textual modalities. It is designed to operate on three IQA subtasks—quality grounding, quality perception, and quality description—in a single framework. The architecture includes:

Task-specific offline augmentation modules:
- Spatial Perturbation Augmentation enables spatial localization by random cropping $(I' = Crop(I; \alpha), \alpha \in (0,1])$ and horizontal flipping $(I^f(x,y) = I(W-x-1, y))$ with corresponding bounding box adjustments.
- Query-Style Aligned Augmentation reforms textual inputs to align training queries with expected test styles.
- Score-Aware Augmentation refines the description task via score-adjusted inference strategies.
Task-Aware Augmented Data Mixing Strategy:

Supervisory signals from the three tasks are fused to promote shared representation learning while retaining task-specific structure.

Online High-Resolution Input Enhancement:

During fine-tuning, images of up to 2048 pixel tokens are used, improving the model’s sensitivity to fine-grained distortions.

This architecture enables simultaneous detection, analysis, and explanation of image degradation in a highly structured and interpretable manner.

2. Core Tasks and Capabilities

iDETEX is fundamentally differentiated by its joint support for three key IQA tasks:

Quality Grounding: Detects distortion boundaries and localizes spatial regions of degradation. Spatial augmentations improve robustness to distortion location and facilitate precise localization.
Quality Perception: Assesses which elements are degraded by selecting semantically relevant descriptions in a multiple-choice format. Query-style alignment ensures training data matches real-world test scenarios for improved generalizability.
Quality Description: Synthesizes localized cues and perception insights into natural language explanations. Score-aware augmentation strategies, including Score-Driven Inference Simplification and Granularity-Aware Label Refinement, discretize the continuous Mean Opinion Score (MOS) into finer quality levels before mapping them to standard five-level human-readable categories (bad, poor, fair, good, excellent).

By integrating these subtasks, iDETEX delivers scalar scores alongside comprehensive, causally motivated justifications for its assessments.

3. Training and Data Augmentation Strategies

Training iDETEX leverages a suite of approaches for efficient and generalizable learning:

Offline augmentation modules:
- Grounding: Random cropping and flipping, with bounding box correction.
- Perception: Query-style rephrasing to match evaluation formats.
- Description: Two-step score-aware augmentations—separating score prediction (Score-Driven Inference Simplification) and refining label granularity.
Augmented data mixing:

Heterogeneous data sources, enhanced with task-specific augmentations, are mixed in various ratios to foster robust joint learning.

Online enhancements:

High-resolution training images supply richer structural information, particularly benefiting distortion localization.

These strategies collectively reduce annotation requirements while supporting shared visual-language representations and strong task-specific performance.

4. Evaluation and Quantitative Results

iDETEX’s capabilities are established on the large-scale ViDA-UGC benchmark and in the ICCV MIPI 2025 Challenge, with comprehensive metrics:

Task	Metric	Reported Value (Challenge)
Grounding	Mean Average Precision	Improvement with high-res
Perception	Accuracy	~0.81
Description	Key Distortion Accuracy	0.43
Description	Image Quality Accuracy	0.83

Grounding: Region and distortion mAP quantitatively assess localization quality.
Perception: Multiple-choice accuracy evaluates semantic understanding of distortion.
Description: Description mAP, Key Distortion Accuracy (ACC₀.₅), and overall Image Quality Accuracy measure interpretability and grading precision.

iDETEX consistently outperform competitor models across all tasks, particularly with high-resolution inputs and augmentation strategies, demonstrating both detailed detection and reliable qualitative assessment.

5. Explainability Mechanisms

Explainability is foundational to iDETEX, achieved via:

Multi-output framework:

Outputs include spatial distortion localization, perceptual analysis, and human-readable descriptions, enabling transparent, stepwise reasoning from raw image to final score.

Task-specific augmentations and mapping:
- Bounding box adjustments offer clear visual cues for distortion.
- Granularity-Aware Label Refinement subdivides MOS into fine intervals and remaps to qualitative levels, directly linking numerical scores to textual judgments.
Score-aware strategies:

By separating score prediction and refining qualitative-mapping, the model’s assessments are transparently traceable and interpretable.

This approach closely mirrors human visual evaluation, superior to scalar-only rating systems in articulating the justification for quality judgments.

6. Applications and Implications

iDETEX’s granularity and explainability enable a range of applications:

Photography and imaging quality control:

Automated, detailed feedback for both consumer and professional imagery.

Guided image restoration:

Precise distortion localization supports targeted enhancement and restoration algorithms.

Industrial and medical image testing:

Diagnostic assessment with interpretable outcomes aids process reliability and issue rectification.

Content moderation and curation:

Human-aligned scoring and explanations promote user trust and support automated platform workflows.

For research, iDETEX’s unified and interpretable IQA marks a conceptual shift from black-box predictors to transparent, multimodal analysis. Data augmentation, mixing, and resolution strategies open avenues for transfer learning and improved generalization in low-data settings. The fusion of high-resolution multimodal processing stands to inspire advancements in related low-level vision tasks.

7. Conceptual Advances and Future Directions

iDETEX encapsulates key advances in multimodal IQA by embedding explainability and multi-task learning into its architecture. The method’s empirical success suggests further investigation into:

Enhanced augmentation and data mixing strategies for larger, more diverse datasets.
Development of architectures optimized for related perceptual tasks (e.g., video, medical imaging quality).
Integration with content production pipelines, where real-time interpretability and granularity are essential.

The paradigm established by iDETEX encourages ongoing research toward interpretable, causally motivated assessment systems across the vision-language domain.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to iDETEX.