NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1)

Published 14 Apr 2026 in cs.CV and cs.AI | (2604.12512v1)

Abstract: In this paper, we present an overview of the NTIRE 2026 challenge on the 3rd Restore Any Image Model in the Wild, specifically focusing on Track 1: Professional Image Quality Assessment. Conventional Image Quality Assessment (IQA) typically relies on scalar scores. By compressing complex visual characteristics into a single number, these methods fundamentally struggle to distinguish subtle differences among uniformly high-quality images. Furthermore, they fail to articulate why one image is superior, lacking the reasoning capabilities required to provide guidance for vision tasks. To bridge this gap, recent advancements in Multimodal LLMs (MLLMs) offer a promising paradigm. Inspired by this potential, our challenge establishes a novel benchmark exploring the ability of MLLMs to mimic human expert cognition in evaluating high-quality image pairs. Participants were tasked with overcoming critical bottlenecks in professional scenarios, centering on two primary objectives: (1) Comparative Quality Selection: reliably identifying the visually superior image within a high-quality pair; and (2) Interpretative Reasoning: generating grounded, expert-level explanations that detail the rationale behind the selection. In total, the challenge attracted nearly 200 registrations and over 2,500 submissions. The top-performing methods significantly advanced the state of the art in professional IQA. The challenge dataset is available at https://github.com/narthchin/RAIM-PIQA, and the official homepage is accessible at https://www.codabench.org/competitions/12789/.

Abstract PDF Upgrade to Chat

Authors (53)

First 10 authors:

Summary

The paper introduces a novel IQA approach by using MLLMs for pairwise comparative analysis and expert-level reasoning.
It presents a dual-branch methodology with ensemble predictions and policy optimization to assess image quality across multiple detailed dimensions.
Results from over 2500 submissions validate the framework's high accuracy and robust generalization in professional photography contexts.

NTIRE 2026 The 3rd Restore Any Image Model (RAIM) Challenge: Professional Image Quality Assessment (Track 1) (2604.12512)

Challenge Overview and Motivation

The NTIRE 2026 3rd RAIM Professional Image Quality Assessment (PIQA) challenge establishes a paradigm shift in IQA by focusing on pairwise comparative assessment and interpretative reasoning. Traditional scalar-based IQA methods, which regress complex image characteristics into a single numerical score (e.g., MOS), are fundamentally limited in high-end, professional scenarios where subtle variances in uniformly high-quality images are critical. These approaches are incapable of providing actionable feedback regarding why an image is preferable, lacking the reasoning required by practitioners and vision system developers.

Recognizing the limitations of scalar regression and the necessity for nuanced comparative analysis, the challenge leverages Multimodal LLMs (MLLMs) for PIQA. The track evaluates both the ability to reliably select the visually superior image from expert-curated pairs and the generation of grounded, expert-level reasoning. This benchmark bridges the gap between automated IQA and professional human judgment, providing industry-relevant guidance for real-world photography and downstream image restoration.

Dataset Curation and Annotation Protocol

A novel reasoning-based dataset was curated, each sample comprising a pair of high-quality images (portrait or landscape) captured by different devices. Expert annotators, imaging designers, provided binary MCQ selection (Image A or B), accompanied by comprehensive text-based rationales rooted in professional photographic dimensions: sharpness, texture, authenticity, and noise. Localized image crops highlight the regions most influential in expert decisions.

The training dataset contains 100 pairs (75 portraits, 25 landscapes) with extensive diversity across subjects, environments, and lighting. Validation and test splits further enforce robustness by zero-overlap policies, introducing unseen subjects and scenes, rigorously preventing data leakage and facilitating reliable assessment of generalization.

Evaluation Framework

The evaluation protocol decomposes model performance into:

Comparative Accuracy (Acc): The proportion of image pair comparisons where the model correctly selects the expert-preferred image.
Reasoning Quality: Dual evaluation through conditional text-based NLG metrics (BLEU-4, ROUGE-L F) computed only for correct decisions, and an LLM-as-a-Judge protocol quantifying semantic alignment to expert annotation, normalized as $S_{LLM}$ .

Final rankings aggregate these metrics across two challenge phases using weighted multiplicative fusion, balancing decision accuracy and reasoning quality. In Phase 3, the LLM-as-a-Judge semantic evaluation distinctly rewards models aligning with professional logic.

Baseline and Proposed Methods

All top-performing teams adopted Qwen3-VL (2B/8B-Instruct) as backbone MLLMs, exploiting LoRA fine-tuning, ensemble prediction strategies, and extensive prompt engineering. Notable methodological trends include:

Dual-branch and Ensemble Frameworks: Separate Answer (binary selection) and Thinking (rationale generation) modules, sometimes branched by image domain (person, scene), with multi-checkpoint ensembles for robust inference (IH-VQA, LZ, I2. Group).
Dimension-specific Comparative Tools: VCIP Pi Group constructed an agentic system with nine Qwen3-VL-2B-Instruct mini-agents for expert dimensions (sharpness, texture, artifacts, realism), orchestrated by a central planner/executor/summarizer workflow.
Policy Optimization: Group Relative Policy Optimization (GRPO) used widely for post-SFT fine-tuning, with reward functions tailored for binary accuracy, format compliance, tool invocation precision, and reasoning attribute coverage.
Data Augmentation: Image pair swapping, localized patch extraction, and domain-based splitting mitigated positional bias and improved model attention to fine-grained attributes.
Reasoning Masking and Selective Backpropagation: Some teams (ongaku) masked intermediate reasoning tokens during loss computation, focusing optimization on final decisions.

Quantitative and Qualitative Results

The challenge attracted 192 registered teams and >2500 submissions. The top-performing teams achieved high comparative accuracy and demonstrated improvement in interpretative reasoning:

IH-VQA: Secured the highest final award score (0.7305) with strong consistency across both accuracy and reasoning evaluation.
VCIP Pi Group: Demonstrated the highest Phase 3 score (0.7679), driven by superior semantic reasoning alignment ( $S_{LLM} = 0.5535$ ).
I2. Group and LZ: Employed multi-stage training and ensemble voting, achieving competitive accuracy and reasoning scores.

Phase 2 accuracy exceeded 0.92 for the leading teams, though conditioned NLG scores (reasoning) remained lower, highlighting that semantic grounding in reasoning is the primary bottleneck. LLM-as-a-Judge evaluation showed further discrimination: some teams' generated rationales, while structurally plausible, lacked fidelity to nuanced expert dimensions (e.g., hair sharpness, local noise).

Implications and Future Directions

The challenge substantiates MLLMs as viable surrogates for professional IQA, particularly in comparative scenarios requiring nuanced, interpretable judgments. Practical implications include:

Transition from Scalar to Descriptive IQA: Automated systems can now provide actionable, domain-specific explanations, guiding both algorithmic improvement and practitioner feedback loops.
Generalization Across Domains: Zero-overlap test splits and domain-branching frameworks illustrate robust generalization to unseen subjects and environments, critical for industrial deployment.
Agentic Systems and Multi-Dimensional Reasoning: The agentic architecture by VCIP Pi Group exemplifies extensibility to arbitrary expert dimensions, promising for future modular evaluation systems and interactive feedback agents.
Reinforcement-driven Optimization: Group Relative Policy Optimization and customized reward design facilitate scalable adaptation to new reasoning and selection tasks, relevant for other multimodal comparative benchmarks.

Theoretical implications concern the alignment of MLLMs with cognitive processes of expert practitioners, setting a foundation for research in explainable vision-language intelligence. Evaluating and improving semantic reasoning quality, particularly in alignment with human expert logic, remains an open challenge. Future work may focus on meta-learning approaches for reasoning attributes, auto-curation of datasets with hard negatives, and adaptive prompt engineering based on context.

Conclusion

NTIRE 2026 RAIM Track 1 advances the field from scalar, regression-based IQA toward professional, comparative and interpretable assessment using MLLMs. The challenge framework, dataset, and multi-tiered evaluation metrics rigorously test both selection competence and reasoning fidelity. The methodologies demonstrated by leading teams showcase scalable architectures for robust, agentic IQA systems. This work will catalyze the development of vision-LLMs delivering actionable, expert-level evaluation for industrial photography, vision model deployment, and beyond.

Markdown Report Issue