Trust Your Critic: Robust Reward Modeling and Reinforcement Learning for Faithful Image Editing and Generation

Published 12 Mar 2026 in cs.CV | (2603.12247v1)

Abstract: Reinforcement learning (RL) has emerged as a promising paradigm for enhancing image editing and text-to-image (T2I) generation. However, current reward models, which act as critics during RL, often suffer from hallucinations and assign noisy scores, inherently misguiding the optimization process. In this paper, we present FIRM (Faithful Image Reward Modeling), a comprehensive framework that develops robust reward models to provide accurate and reliable guidance for faithful image generation and editing. First, we design tailored data curation pipelines to construct high-quality scoring datasets. Specifically, we evaluate editing using both execution and consistency, while generation is primarily assessed via instruction following. Using these pipelines, we collect the FIRM-Edit-370K and FIRM-Gen-293K datasets, and train specialized reward models (FIRM-Edit-8B and FIRM-Gen-8B) that accurately reflect these criteria. Second, we introduce FIRM-Bench, a comprehensive benchmark specifically designed for editing and generation critics. Evaluations demonstrate that our models achieve superior alignment with human judgment compared to existing metrics. Furthermore, to seamlessly integrate these critics into the RL pipeline, we formulate a novel "Base-and-Bonus" reward strategy that balances competing objectives: Consistency-Modulated Execution (CME) for editing and Quality-Modulated Alignment (QMA) for generation. Empowered by this framework, our resulting models FIRM-Qwen-Edit and FIRM-SD3.5 achieve substantial performance breakthroughs. Comprehensive experiments demonstrate that FIRM mitigates hallucinations, establishing a new standard for fidelity and instruction adherence over existing general models. All of our datasets, models, and code have been publicly available at https://firm-reward.github.io.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents the FIRM framework, which leverages tailored data pipelines and reward shaping (CME and QMA) to achieve high human-aligned image editing and generation.
It introduces innovative pipelines—'difference-first' for editing and 'plan-then-score' for generation—that significantly reduce hallucinations and reward hacking.
Experimental results on FIRM-Bench demonstrate that FIRM models outperform both proprietary and open-source models in accuracy and efficiency.

Faithful Image Reward Modeling: Bridging Reliability Gaps in Critic-Guided Image Editing and Generation

Introduction and Motivation

The paradigm shift toward Reinforcement Learning (RL) in image editing and text-to-image (T2I) generative modeling has surfaced a critical challenge: the reliability of reward models that serve as critics during the optimization process. Existing Multimodal LLMs (MLLMs), while excelling at general vision-language comprehension, routinely exhibit hallucinations, ignore fine-grained spatial or attribute directives, and emit noisy reward distributions when deployed zero-shot in image editing or generation tasks. This unreliability induces optimization failures, including reward hacking, stagnation, or degeneration to trivial solutions, critically bottlenecking progress in faithful image generation or intricate editing.

The FIRM (Faithful Image Reward Modeling) framework (2603.12247) comprehensively addresses these issues by constructing specialized, highly reliable reward models for both image editing (FIRM-Edit-8B) and T2I generation (FIRM-Gen-8B), orchestrated through meticulously engineered data curation pipelines and advanced reward shaping strategies. This work establishes new performance benchmarks in both subfields, with reward models demonstrably outperforming both proprietary and open-source incumbents in their alignment with human preference.

Data Curation and Reward Modeling Pipelines

A core innovation underlying FIRM is its bespoke reward data construction methodology, tailored for the unique challenges posed by image editing versus generation.

For image editing, the "difference-first" pipeline mitigates the evaluation mismatch observed in MLLMs: while models perform poorly as direct critics, they excel at describing visual differences between source and edited images. The approach leverages strong MLLMs to generate comprehensive difference reports, then conditions reward assessment on these, isolating task-execution (execution score) from context preservation (consistency score), with expert prompt design to elicit fine-grained supervision.

For T2I generation, the "plan-then-score" pipeline decomposes complex prompt requirements into a checklist using a large LLM as a planner. This plan is then supplied to a powerful MLLM evaluator, which systematically audits each dimension and aggregates performance, sharply reducing the attention dilution and hallucination observed in single-shot MLLM scoring.

Figure 1: FIRM's data curation architecture: "difference-first" for editing, and "plan-then-score" for generation reward dataset construction.

These pipelines synthesize two large-scale, high-quality datasets: FIRM-Edit-370K (editing) and FIRM-Gen-293K (generation), enabling task-specialized reward model pretraining/fine-tuning.

FIRM-Bench: A Rigorous Human-Aligned Evaluation Standard

To validate the faithfulness and human alignment of the learned critics, FIRM introduces FIRM-Bench, a human-expert-scored benchmark spanning both editing (execution, consistency) and generation (instruction following). Prompts are sourced from diverse, challenging datasets and images from heterogeneous models, with a uniform ground-truth score distribution.

Empirical results reveal that both FIRM-Edit-8B and FIRM-Gen-8B surpass leading proprietary (e.g., GPT-5, Gemini-3-Pro) and open-source (Qwen3-VL, InternVL3.5) reward models in mean absolute error (MAE) with respect to human annotation, even at lower parameter budgets.

Figure 2: FIRM reward models provide judgments more consistent with human assessment than strong baselines, as shown by qualitative FIRM-Bench examples.

Reward Formulation and RL Integration

Simple linear or fixed weighted reward formulations in RL pipelines are susceptible to reward hacking: e.g., optimizing for high consistency but not executing edits, or producing degenerate T2I outputs that trivially maximize instruction following while sacrificing visual quality.

To counteract this, FIRM introduces:

Consistency-Modulated Execution (CME): For editing, the reward is $R_{CME} = \text{Execution} \cdot (w_1 + w_2 \cdot \text{Consistency})$ , enforcing execution as a primary reward with consistency as a multiplicative gain, sharply suppressing strategies that hack the reward surface.
Quality-Modulated Alignment (QMA): For T2I, $R_{QMA} = \text{InsFollowing} \cdot (w_1 + w_2 \cdot \text{Quality})$ , ensuring that purely instruction-satisfying but low-quality generations are penalized, especially for simple prompts.

Ablation studies demonstrate the necessity of these formulations: naive alternatives result in significant performance collapse or misalignment.

Figure 3: Reward trajectories illustrate how raw weighted sum rewards suffer from credit assignment failures, whereas CME provides stable optimization without reward hacking.

Experimental Results

Reward Model Alignment

On FIRM-Bench, FIRM-Edit-8B achieves the lowest MAE for execution (0.53) and a strong consistency MAE (0.73), outperforming even much larger proprietary models (e.g., GPT-5, Gemini-3-Pro) as well as all open-source baselines. FIRM-Gen-8B exhibits similar strengths, especially on hard T2I examples.

RL-Guided Image Editing

When FIRM-Edit-8B is deployed as a critic in RL fine-tuning, FIRM-Qwen-Edit establishes new SOTA on GEditBench (G_Overall: 7.84) and highly competitive performance on ImgEdit. Notably, these gains require far fewer RL samples than prior state-of-the-art models—demonstrating efficient credit assignment and data efficiency.

Figure 4: Qualitative image editing comparisons validate the perceptible fidelity gain of FIRM-guided RL over competitive baselines.

Figure 5: RL reward curves: FIRM-based critics provide sharper discrimination, avoid optimistic bias, and support stable long-horizon policy improvement.

RL-Guided Text-to-Image Generation

Analogous performance gains are realized in T2I. FIRM-SD3.5 (SD3.5-Medium + RL with FIRM-Gen-8B) exceeds baseline SD3.5 and Qwen3-VL reward model RL across all major benchmarks (GenEval, DPG-Bench, TIIF, UniGenBench++), particularly as prompt complexity increases—highlighting FIRM’s robustness in complex compositional scenes.

Figure 6: Representative T2I generations via FIRM reward RL show visible improvements in instruction following and visual quality.

Theoretical and Practical Implications

This work empirically and methodologically redefines the standard for critic-guided RL in generative vision. It demonstrates:

General-purpose MLLMs are insufficient as high-fidelity critics for fine-grained editing or compositional generation, regardless of scaling.
Careful data pipeline engineering and specialized reward decoupling (difference-first and plan-then-score) significantly magnify critic accuracy relative to human evaluation.
Multiplicative reward shaping (CME, QMA) is essential for circumventing reward hacking and fostering alignment with target objectives.
Data, not just model size, constrains critic performance—a result generalizable to other domains (e.g., video, 3D, controllable synthesis).

Practically, these insights inform the design of robust RL alignment for any generative model requiring semantic, controllable, or human-aligned outputs. With the release of all FIRM datasets, benchmarks, and code, FIRM provides both methodology and infrastructure for scalable, reliable reward modeling.

Conclusion

FIRM sets a comprehensive new direction for reward modeling in RL-driven image editing and generation. By coupling tailored data curation, specialized reward model pretraining, rigorous benchmarks, and principled reward shaping, it enables demonstrably more faithful alignment with human intent and visual fidelity. This framework is likely to become foundational for subsequent work on RL-aligned vision-language foundation models, reward-guided creative pipelines, and scalable critic-centric generative systems.

Markdown Report Issue