OmniQuality-R: Unified Quality Framework
- The paper introduces a unified framework that integrates technical quality, aesthetic assessment, and text–image alignment into a single evaluative model.
- It employs a plan-then-reason approach that generates explicit chain-of-thought rationales, enhancing interpretability of reward signals.
- By leveraging continuous Gaussian reward functions and robust reinforcement learning techniques, OmniQuality-R achieves stable policy optimization and strong generalization across multiple benchmarks.
OmniQuality-R is a unified reward modeling framework that advances visual quality assessment by integrating multi-task reasoning, interpretable chain-of-thought (CoT) rationales, and continuous, stable reward signals suitable for modern policy optimization. It incorporates technical quality, aesthetic assessment, and text–image alignment tasks within a single architecture, providing a scalable basis for optimizing generative models and evaluation systems that require multiplanar, interpretable judgments.
1. Unified Multi-Task Quality Modeling
OmniQuality-R is characterized by its multi-dimensional integration of assessment dimensions—technical quality (addressing low-level degradations such as sharpness or artifacts), aesthetic quality (subjective visual appeal and composition), and text–image alignment (semantic consistency between paired inputs). Conventional evaluation models tend to specialize in only one of these axes, but OmniQuality-R treats multi-task evaluation as a first-class objective. The framework is designed to output not simply a scalar quality value, but an interpretable reward signal computed through structured multi-step quality reasoning analogous to how expert assessors follow explicit rating plans during subjective studies.
Tasks are formulated as prompts that request ratings for specific properties of an image (or image–text pair). Each task involves both the automated generation of an analytic plan and the production of chained explanatory reasoning—mirroring how human judges establish guidelines and then analyze images step by step before delivering a score.
2. Reasoning-Enhanced Reward Modeling with Plan–Reason Trajectories
Central to the OmniQuality-R methodology is a plan-then-reason data generation protocol. For each assessment task, a plan is generated that lists the relevant evaluation criteria (e.g., “check for sharpness, inspect for artifacts, assess composition”). This plan is then used in combination with the task prompt and the input image to produce a chain-of-thought reasoning sequence via a multimodal LLM (MLLM). This chain-of-thought provides not only a score, but also an explicit rationale underlying the quality judgment, increasing both transparency and the supervisions' informativeness.
To ensure only informative and discriminative plan–reason samples populate the training set, rejection sampling is applied: plan–reason pairs that are too convergent (trivial), too divergent (dubious), or otherwise uninformative are filtered. The curated plan–reason dataset is then used for supervised fine-tuning so the model learns to produce both score and rationale; notably, during inference only the prompt and image are needed—the explicit planning step is omitted because it has been internalized during training.
3. Policy Optimization: Continuous Reward and Group Relative Policy Optimization
OmniQuality-R refines reward model performance through a specialized reinforcement learning protocol. It adopts Group Relative Policy Optimization (GRPO), extending beyond binary or discretized reward signals. Instead, the framework uses a Gaussian-based continuous reward function, which provides a smooth gradient for policy optimization:
where is the model-predicted score, is the human-labeled ground-truth score, and controls the sharpness of reward decay.
Multiple responses are sampled for each query, forming a group. Intra-group reward variance is computed, and only groups with sufficiently high standard deviation are retained for policy updates. This ensures that the policy gradient benefits from samples with informative advantage estimates—suppressing updates derived from overly consistent or trivial response sets.
4. Stabilization: Standard Deviation Filtering and Entropy Gating
To improve training stability and prevent mode collapse, two key methods are introduced:
- Standard Deviation (STD) Filtering: Within each group, the standard deviation of the Gaussian rewards is measured. If (where is a preset threshold), the entire group is discarded from the update. This ensures gradient steps only leverage samples providing meaningful reward differences, preventing ineffectual or misleading updates.
- Entropy Gating: During policy optimization, token-level output entropy is computed for each time step. Policy gradients are only applied where (again, for some threshold ), focusing updates on output regions where the model is uncertain and exploratory, and skipping confidently predicted (and likely already learned) regions. This sustains exploration and guards against latent output space collapse (premature convergence).
The combined effect is improved robustness and generalizability in downstream optimization tasks.
5. Evaluation Protocol and Empirical Results
OmniQuality-R is validated across three primary image quality assessment domains:
- Technical Quality Assessment is evaluated using real-world datasets such as KonIQ, LIVE-C, SPAQ, as well as synthetic benchmarks (KADID-10k, PIPAL), demonstrating competitive Pearson (PLCC) and Spearman (SRCC) correlations with human opinion across a wide range of distortions and image scenarios.
- Aesthetic Quality Assessment is performed on both in-domain (AVA) and out-of-domain (TAD66K) benchmarks, showing strong generalization and performance parity or improvement over state-of-the-art baselines.
- Text–Image Alignment is measured using paired caption-image datasets (EvalMuse, EvalMi, GenAI-Bench, T2I-CompBench). OmniQuality-R achieves high rank correlation and score accuracy, even in few-shot settings, and is effective in guiding text-to-image generation for improved semantic alignment and compositional coherence without additional generator retraining.
A key finding is that this unified approach enables strong and interpretable generalization—encompassing low-level, high-level, and cross-modal assessment properties within a single model.
6. Key Mathematical Formulations
OmniQuality-R relies on the following optimization and reward computation formulas:
- Gaussian Reward for Continuous Score Prediction:
where is the predicted score, is the ground-truth, and determines reward decay.
- Policy Gradient Loss with Entropy Gating:
where is the policy ratio, is the advantage estimate, is the entropy gating mask, and is the trust region clip parameter.
7. Significance and Applications
The OmniQuality-R framework marks a shift from legacy single-score, single-task evaluation toward unified, interpretable, and multi-dimensional quality assessment for generative systems and evaluation pipelines. The incorporation of plan–reason trajectories, continuous reward modeling, and robust RL stabilization yields performance gains not only in standard IQA metrics but also in practical deployment: for instance, OmniQuality-R can serve as a reliable, interpretable reward signal for test-time guidance of text-to-image generators, improving semantic alignment and overall content quality.
Such a unified, stable, and interpretable assessment architecture is positioned to become foundational in multi-modal generative model evaluation, human-in-the-loop RL applications, and next-generation LLM-based image understanding systems.