Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 131 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 26 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 71 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 385 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

OmniQuality-R: Unified Quality Framework

Updated 19 October 2025
  • The paper introduces a unified framework that integrates technical quality, aesthetic assessment, and text–image alignment into a single evaluative model.
  • It employs a plan-then-reason approach that generates explicit chain-of-thought rationales, enhancing interpretability of reward signals.
  • By leveraging continuous Gaussian reward functions and robust reinforcement learning techniques, OmniQuality-R achieves stable policy optimization and strong generalization across multiple benchmarks.

OmniQuality-R is a unified reward modeling framework that advances visual quality assessment by integrating multi-task reasoning, interpretable chain-of-thought (CoT) rationales, and continuous, stable reward signals suitable for modern policy optimization. It incorporates technical quality, aesthetic assessment, and text–image alignment tasks within a single architecture, providing a scalable basis for optimizing generative models and evaluation systems that require multiplanar, interpretable judgments.

1. Unified Multi-Task Quality Modeling

OmniQuality-R is characterized by its multi-dimensional integration of assessment dimensions—technical quality (addressing low-level degradations such as sharpness or artifacts), aesthetic quality (subjective visual appeal and composition), and text–image alignment (semantic consistency between paired inputs). Conventional evaluation models tend to specialize in only one of these axes, but OmniQuality-R treats multi-task evaluation as a first-class objective. The framework is designed to output not simply a scalar quality value, but an interpretable reward signal computed through structured multi-step quality reasoning analogous to how expert assessors follow explicit rating plans during subjective studies.

Tasks are formulated as prompts that request ratings for specific properties of an image (or image–text pair). Each task involves both the automated generation of an analytic plan and the production of chained explanatory reasoning—mirroring how human judges establish guidelines and then analyze images step by step before delivering a score.

2. Reasoning-Enhanced Reward Modeling with Plan–Reason Trajectories

Central to the OmniQuality-R methodology is a plan-then-reason data generation protocol. For each assessment task, a plan is generated that lists the relevant evaluation criteria (e.g., “check for sharpness, inspect for artifacts, assess composition”). This plan is then used in combination with the task prompt and the input image to produce a chain-of-thought reasoning sequence via a multimodal LLM (MLLM). This chain-of-thought provides not only a score, but also an explicit rationale underlying the quality judgment, increasing both transparency and the supervisions' informativeness.

To ensure only informative and discriminative plan–reason samples populate the training set, rejection sampling is applied: plan–reason pairs that are too convergent (trivial), too divergent (dubious), or otherwise uninformative are filtered. The curated plan–reason dataset is then used for supervised fine-tuning so the model learns to produce both score and rationale; notably, during inference only the prompt and image are needed—the explicit planning step is omitted because it has been internalized during training.

3. Policy Optimization: Continuous Reward and Group Relative Policy Optimization

OmniQuality-R refines reward model performance through a specialized reinforcement learning protocol. It adopts Group Relative Policy Optimization (GRPO), extending beyond binary or discretized reward signals. Instead, the framework uses a Gaussian-based continuous reward function, which provides a smooth gradient for policy optimization:

R=exp((s^s)22σ2)R = \exp\left( - \frac{(\hat{s} - s^*)^2}{2\sigma^2} \right)

where s^\hat{s} is the model-predicted score, ss^* is the human-labeled ground-truth score, and σ\sigma controls the sharpness of reward decay.

Multiple responses are sampled for each query, forming a group. Intra-group reward variance is computed, and only groups with sufficiently high standard deviation are retained for policy updates. This ensures that the policy gradient benefits from samples with informative advantage estimates—suppressing updates derived from overly consistent or trivial response sets.

4. Stabilization: Standard Deviation Filtering and Entropy Gating

To improve training stability and prevent mode collapse, two key methods are introduced:

  • Standard Deviation (STD) Filtering: Within each group, the standard deviation of the Gaussian rewards is measured. If σ(i)<τ\sigma^{(i)} < \tau (where τ\tau is a preset threshold), the entire group is discarded from the update. This ensures gradient steps only leverage samples providing meaningful reward differences, preventing ineffectual or misleading updates.
  • Entropy Gating: During policy optimization, token-level output entropy HtiH_t^i is computed for each time step. Policy gradients are only applied where HtiτH_t^i \ge \tau (again, for some threshold τ\tau), focusing updates on output regions where the model is uncertain and exploratory, and skipping confidently predicted (and likely already learned) regions. This sustains exploration and guards against latent output space collapse (premature convergence).

The combined effect is improved robustness and generalizability in downstream optimization tasks.

5. Evaluation Protocol and Empirical Results

OmniQuality-R is validated across three primary image quality assessment domains:

  • Technical Quality Assessment is evaluated using real-world datasets such as KonIQ, LIVE-C, SPAQ, as well as synthetic benchmarks (KADID-10k, PIPAL), demonstrating competitive Pearson (PLCC) and Spearman (SRCC) correlations with human opinion across a wide range of distortions and image scenarios.
  • Aesthetic Quality Assessment is performed on both in-domain (AVA) and out-of-domain (TAD66K) benchmarks, showing strong generalization and performance parity or improvement over state-of-the-art baselines.
  • Text–Image Alignment is measured using paired caption-image datasets (EvalMuse, EvalMi, GenAI-Bench, T2I-CompBench). OmniQuality-R achieves high rank correlation and score accuracy, even in few-shot settings, and is effective in guiding text-to-image generation for improved semantic alignment and compositional coherence without additional generator retraining.

A key finding is that this unified approach enables strong and interpretable generalization—encompassing low-level, high-level, and cross-modal assessment properties within a single model.

6. Key Mathematical Formulations

OmniQuality-R relies on the following optimization and reward computation formulas:

  • Gaussian Reward for Continuous Score Prediction:

R=exp((s^s)22σ2)R = \exp\left( - \frac{(\hat{s} - s^*)^2}{2\sigma^2} \right)

where s^\hat{s} is the predicted score, ss^* is the ground-truth, and σ\sigma determines reward decay.

  • Policy Gradient Loss with Entropy Gating:

JB(θ)=EB[1i=1Goii=1Gt=1oiI(Htiτ)min(rti(θ)A^ti,clip(rti(θ),1ε,1+ε)A^ti)]J^B(\theta) = \mathbb{E}_B \left[ \frac{1}{\sum_{i=1}^G |o^i|} \sum_{i=1}^G \sum_{t=1}^{|o^i|} I(H_t^i \geq \tau)\, \min \left( r_t^i(\theta) \widehat{A}_t^i, \operatorname{clip}(r_t^i(\theta), 1-\varepsilon, 1+\varepsilon) \widehat{A}_t^i \right) \right]

where rti(θ)r_t^i(\theta) is the policy ratio, A^ti\widehat{A}_t^i is the advantage estimate, I(Htiτ)I(H_t^i \geq \tau) is the entropy gating mask, and ε\varepsilon is the trust region clip parameter.

7. Significance and Applications

The OmniQuality-R framework marks a shift from legacy single-score, single-task evaluation toward unified, interpretable, and multi-dimensional quality assessment for generative systems and evaluation pipelines. The incorporation of plan–reason trajectories, continuous reward modeling, and robust RL stabilization yields performance gains not only in standard IQA metrics but also in practical deployment: for instance, OmniQuality-R can serve as a reliable, interpretable reward signal for test-time guidance of text-to-image generators, improving semantic alignment and overall content quality.

Such a unified, stable, and interpretable assessment architecture is positioned to become foundational in multi-modal generative model evaluation, human-in-the-loop RL applications, and next-generation LLM-based image understanding systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OmniQuality-R.