ViDA-UGC Benchmark for Explainable IQA
- ViDA-UGC Benchmark is a framework that uses distortion-aware instruction tuning with human annotations and multimodal model integration.
- It employs rigorous sampling via MILP and region-level distortion marking to ensure diverse and robust IQA metrics.
- The benchmark enhances explainability in image quality analysis by integrating chain-of-thought reasoning and expert review of annotations.
The ViDA-UGC Benchmark is a framework and dataset suite developed to advance detailed, explainable image quality analysis for user-generated content (UGC) images. Distinguished from prior IQA resources, ViDA-UGC emphasizes distortion-aware instruction tuning, fine-grained grounding, and causal reasoning, combining extensive human annotation with advanced multimodal LLMs. Its construction incorporates statistically robust sampling, region-level distortion marking, and structured chain-of-thought prompts to empower both qualitative and quantitative evaluation of real-world, diverse UGC image artifacts.
1. Dataset Composition and Annotation Protocol
ViDA-UGC consists of 11,534 UGC images, each annotated through a distortion-oriented, human-in-the-loop pipeline. For each image:
- Human Assessment: Five human subjects assign MOS scores and manually annotate bounding boxes for distortion regions. The selection achieves uniform coverage of low-level image attributes via a Mixed Integer Linear Programming (MILP) sampling procedure, not random sampling.
- Distortion Regions: Annotators differentiate between “global” (bounding box area ratio > 0.7) and “local” distortions, the latter refined with Non-Maximum Suppression (NMS) to avoid redundant regional labeling.
- Annotation Dimensions: Each distortion is described along five attributes—type, position, severity, impact, significance—allowing comprehensive low-level visual characterization.
Summary Table: | Subset | Purpose | Data Types | Example Tasks | |--------------------|-------------------------------|-----------------------------------|------------------------------| | ViDA-Grounding | Distortion localization | Boxes, region marks | Grounding, referring, region | | ViDA-Perception | Low-level attribute analysis | Q/A pairs (multi-choice/VQA) | Type, severity, impact | | ViDA-Description | Reasoned quality description | Causal chain-of-thought (CoT) | Overall and local causalism |
2. Distortion-Oriented Construction and Chain-of-Thought Framework
Central to ViDA-UGC’s methodology is its instruction-tuning pipeline utilizing Chain-of-Thought (CoT) reasoning:
- Prompt Engineering: Unlike direct prompts, CoT prompts decompose quality assessment into granular substeps: general impression, distortion detection, detailed analysis, selection of key distortions, causal reasoning, and final score assignment.
- Integration with GPT-4o: Human ground-truth marks and attributes are encoded as “set-of-mark” visual tokens and incorporated into the LLM prompt context.
- Causal Analysis: The annotated descriptions are not superficial; each quality judgment is explicitly linked to specific distortions via linguistic reasoning, simulating expert interpretability.
CoT logic flow (LaTeX):
3. Benchmark Structure and Evaluation Tasks
The ViDA-UGC-Bench is derived from the main dataset and comprises 476 images with 6,149 QA pairs. Its composition is:
- Quality Analysis Instances: Each image has one overall description task demanding step-wise quality reasoning.
- Multi-Choice Perception Tasks: 2,567 QA pairs covering local attributes (e.g., “What is the type of distortion in region 2?”).
- Grounding Tasks: 3,106 region grounding instances require models to both localize and describe distortions.
Distinctive features include expert validation of all GPT-generated annotation, guaranteeing high accuracy and reducing bias compared to prior automated labeling benchmarks.
4. Technical Sampling, Annotation, and Quality Scoring
The MILP-based sampling ensures diverse feature coverage without over-representation:
- Sampling Objective: Uniformly sample subject to and , balancing feature distributions.
- Region Categorization:
- If
- then “global”; otherwise, NMS refines “local” boxes.
- Quality Scores: MOS aggregation is only performed where inter-annotator agreement is adequate.
5. Applications and Implications
ViDA-UGC supports a diversity of quality analysis tasks, directly benefiting multiple domains:
- Quality Control and Monitoring: Enables precise localization and identification of diverse artifacts, automating quality management pipelines.
- Image Restoration Guidance: Grounded cause-and-effect analysis allows for targeted restoration strategies, informed by which distortions most affect perceived quality.
- Explainability in IQA: The dataset’s causal structure and low-level attribute details allow for interpretable outcomes, enhancing trust and usability of automatic IQA systems.
Experimental evidence confirms that instruction tuning MLLMs on ViDA-UGC-Bench can improve performance beyond even GPT-4o, underscoring the benchmark’s utility for advancing explainable and high-performing IQA.
6. Limitations, Expert Validation, and Future Directions
- Expert-Based Corrections: All GPT outputs in the benchmark are reviewed by a team of image-processing professionals to mitigate LLM-induced errors or hallucinations.
- Bias and Consistency: Despite strong performance, potential bias remains due to the reliance on GPT-4o for large-scale generation; expert verification partly addresses this concern.
- Extensions: The framework naturally generalizes to other modalities (e.g., video, cross-modal IQA) and can incorporate additional distortion types and finer attribute distinctions in future releases.
- Research Opportunities: Fine-grained, reasoning-based QA tasks present new challenges for next-generation MLLMs, supporting improvements in restoration, optimization, and end-user explainability.
7. Comparative Context and Significance
Compared with prior benchmarks (e.g., Q-Bench), ViDA-UGC and ViDA-UGC-Bench are unique in their requirement for causal reasoning, fine-grained distortion marking, and statistically uniform sample distribution. This positions ViDA-UGC as the most challenging and comprehensive benchmark for instruction-tuned MLLMs in the UGC domain, representing a paradigm shift in the field’s approach to image quality analysis and restoration guidance.
In conclusion, the ViDA-UGC Benchmark establishes a new standard for explainable, fine-grained IQA in UGC, combining robust statistical design, expert-labeled ground truth, and advanced linguistic reasoning. Its deployment enables broad advancements in automated image quality analysis, restoration, and explainable decision support across real-world user-generated content contexts (Liao et al., 18 Aug 2025).