ViDA-UGC Benchmark for Explainable IQA

Updated 27 October 2025

ViDA-UGC Benchmark is a framework that uses distortion-aware instruction tuning with human annotations and multimodal model integration.
It employs rigorous sampling via MILP and region-level distortion marking to ensure diverse and robust IQA metrics.
The benchmark enhances explainability in image quality analysis by integrating chain-of-thought reasoning and expert review of annotations.

The ViDA-UGC Benchmark is a framework and dataset suite developed to advance detailed, explainable image quality analysis for user-generated content (UGC) images. Distinguished from prior IQA resources, ViDA-UGC emphasizes distortion-aware instruction tuning, fine-grained grounding, and causal reasoning, combining extensive human annotation with advanced multimodal LLMs. Its construction incorporates statistically robust sampling, region-level distortion marking, and structured chain-of-thought prompts to empower both qualitative and quantitative evaluation of real-world, diverse UGC image artifacts.

1. Dataset Composition and Annotation Protocol

ViDA-UGC consists of 11,534 UGC images, each annotated through a distortion-oriented, human-in-the-loop pipeline. For each image:

Human Assessment: Five human subjects assign MOS scores and manually annotate bounding boxes for distortion regions. The selection achieves uniform coverage of low-level image attributes via a Mixed Integer Linear Programming (MILP) sampling procedure, not random sampling.
Distortion Regions: Annotators differentiate between “global” (bounding box area ratio > 0.7) and “local” distortions, the latter refined with Non-Maximum Suppression (NMS) to avoid redundant regional labeling.
Annotation Dimensions: Each distortion is described along five attributes—type, position, severity, impact, significance—allowing comprehensive low-level visual characterization.

Summary Table: | Subset | Purpose | Data Types | Example Tasks | |--------------------|-------------------------------|-----------------------------------|------------------------------| | ViDA-Grounding | Distortion localization | Boxes, region marks | Grounding, referring, region | | ViDA-Perception | Low-level attribute analysis | Q/A pairs (multi-choice/VQA) | Type, severity, impact | | ViDA-Description | Reasoned quality description | Causal chain-of-thought (CoT) | Overall and local causalism |

2. Distortion-Oriented Construction and Chain-of-Thought Framework

Central to ViDA-UGC’s methodology is its instruction-tuning pipeline utilizing Chain-of-Thought (CoT) reasoning:

Prompt Engineering: Unlike direct prompts, CoT prompts decompose quality assessment into granular substeps: general impression, distortion detection, detailed analysis, selection of key distortions, causal reasoning, and final score assignment.
Integration with GPT-4o: Human ground-truth marks and attributes are encoded as “set-of-mark” visual tokens and incorporated into the LLM prompt context.
Causal Analysis: The annotated descriptions are not superficial; each quality judgment is explicitly linked to specific distortions via linguistic reasoning, simulating expert interpretability.

CoT logic flow (LaTeX):

$\text{Step 1: General impression} \rightarrow \text{Step 2: Distortion detection} \rightarrow \text{Step 3: Analysis} \rightarrow \text{Step 4: Causal reasoning} \rightarrow \text{Step 5: Quality rating}$

3. Benchmark Structure and Evaluation Tasks

The ViDA-UGC-Bench is derived from the main dataset and comprises 476 images with 6,149 QA pairs. Its composition is:

Quality Analysis Instances: Each image has one overall description task demanding step-wise quality reasoning.
Multi-Choice Perception Tasks: 2,567 QA pairs covering local attributes (e.g., “What is the type of distortion in region 2?”).
Grounding Tasks: 3,106 region grounding instances require models to both localize and describe distortions.

Distinctive features include expert validation of all GPT-generated annotation, guaranteeing high accuracy and reducing bias compared to prior automated labeling benchmarks.

4. Technical Sampling, Annotation, and Quality Scoring

The MILP-based sampling ensures diverse feature coverage without over-representation:

Sampling Objective: Uniformly sample $x$ subject to $Ax \le b$ and $x \in \mathbb{Z}$ , balancing feature distributions.
Region Categorization:
- If $\frac{\text{Area}_{\text{box}}}{\text{Area}_{\text{image}}} > 0.7$
- then “global”; otherwise, NMS refines “local” boxes.
Quality Scores: MOS aggregation is only performed where inter-annotator agreement is adequate.

5. Applications and Implications

ViDA-UGC supports a diversity of quality analysis tasks, directly benefiting multiple domains:

Quality Control and Monitoring: Enables precise localization and identification of diverse artifacts, automating quality management pipelines.
Image Restoration Guidance: Grounded cause-and-effect analysis allows for targeted restoration strategies, informed by which distortions most affect perceived quality.
Explainability in IQA: The dataset’s causal structure and low-level attribute details allow for interpretable outcomes, enhancing trust and usability of automatic IQA systems.

Experimental evidence confirms that instruction tuning MLLMs on ViDA-UGC-Bench can improve performance beyond even GPT-4o, underscoring the benchmark’s utility for advancing explainable and high-performing IQA.

6. Limitations, Expert Validation, and Future Directions

Expert-Based Corrections: All GPT outputs in the benchmark are reviewed by a team of image-processing professionals to mitigate LLM-induced errors or hallucinations.
Bias and Consistency: Despite strong performance, potential bias remains due to the reliance on GPT-4o for large-scale generation; expert verification partly addresses this concern.
Extensions: The framework naturally generalizes to other modalities (e.g., video, cross-modal IQA) and can incorporate additional distortion types and finer attribute distinctions in future releases.
Research Opportunities: Fine-grained, reasoning-based QA tasks present new challenges for next-generation MLLMs, supporting improvements in restoration, optimization, and end-user explainability.

7. Comparative Context and Significance

Compared with prior benchmarks (e.g., Q-Bench), ViDA-UGC and ViDA-UGC-Bench are unique in their requirement for causal reasoning, fine-grained distortion marking, and statistically uniform sample distribution. This positions ViDA-UGC as the most challenging and comprehensive benchmark for instruction-tuned MLLMs in the UGC domain, representing a paradigm shift in the field’s approach to image quality analysis and restoration guidance.

In conclusion, the ViDA-UGC Benchmark establishes a new standard for explainable, fine-grained IQA in UGC, combining robust statistical design, expert-labeled ground truth, and advanced linguistic reasoning. Its deployment enables broad advancements in automated image quality analysis, restoration, and explainable decision support across real-world user-generated content contexts (Liao et al., 18 Aug 2025).

PDF Markdown Chat (Pro)

References (1)

ViDA-UGC: Detailed Image Quality Analysis via Visual Distortion Assessment for UGC Images (2025)

Follow Topic

Get notified by email when new papers are published related to ViDA-UGC Benchmark.