Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 82 tok/s
Gemini 2.5 Pro 52 tok/s Pro
GPT-5 Medium 19 tok/s Pro
GPT-5 High 17 tok/s Pro
GPT-4o 107 tok/s Pro
Kimi K2 174 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Claude Sonnet 4 37 tok/s Pro
2000 character limit reached

RewardDance: Reward Scaling in Visual Generation (2509.08826v1)

Published 10 Sep 2025 in cs.CV

Abstract: Reward Models (RMs) are critical for improving generation models via Reinforcement Learning (RL), yet the RM scaling paradigm in visual generation remains largely unexplored. It primarily due to fundamental limitations in existing approaches: CLIP-based RMs suffer from architectural and input modality constraints, while prevalent Bradley-Terry losses are fundamentally misaligned with the next-token prediction mechanism of Vision-LLMs (VLMs), hindering effective scaling. More critically, the RLHF optimization process is plagued by Reward Hacking issue, where models exploit flaws in the reward signal without improving true quality. To address these challenges, we introduce RewardDance, a scalable reward modeling framework that overcomes these barriers through a novel generative reward paradigm. By reformulating the reward score as the model's probability of predicting a "yes" token, indicating that the generated image outperforms a reference image according to specific criteria, RewardDance intrinsically aligns reward objectives with VLM architectures. This alignment unlocks scaling across two dimensions: (1) Model Scaling: Systematic scaling of RMs up to 26 billion parameters; (2) Context Scaling: Integration of task-specific instructions, reference examples, and chain-of-thought (CoT) reasoning. Extensive experiments demonstrate that RewardDance significantly surpasses state-of-the-art methods in text-to-image, text-to-video, and image-to-video generation. Crucially, we resolve the persistent challenge of "reward hacking": Our large-scale RMs exhibit and maintain high reward variance during RL fine-tuning, proving their resistance to hacking and ability to produce diverse, high-quality outputs. It greatly relieves the mode collapse problem that plagues smaller models.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

  • The paper introduces a generative reward modeling paradigm that reframes reward prediction as token generation aligned with VLM architectures.
  • It systematically scales reward models by increasing parameters up to 26B and integrating task instructions, reference examples, and chain-of-thought reasoning.
  • Enhanced resistance to reward hacking and mode collapse is demonstrated across text-to-image, text-to-video, and image-to-video tasks with state-of-the-art results.

RewardDance: A Scalable Generative Reward Modeling Paradigm for Visual Generation

Introduction

Reward models (RMs) are central to aligning generative models with human preferences, particularly in the context of reinforcement learning from human feedback (RLHF) for visual generation. However, the scaling of RMs in visual domains has been hindered by architectural and methodological mismatches. CLIP-based RMs are limited by their dual-encoder structure and unimodal design, while regressive RMs with appended regression heads are fundamentally misaligned with the autoregressive, next-token prediction mechanism of modern vision-LLMs (VLMs). These limitations restrict both the scalability and the effectiveness of reward modeling, leading to issues such as reward hacking and mode collapse during RL fine-tuning.

RewardDance introduces a generative reward modeling paradigm that reframes reward prediction as a token generation task, aligning the reward objective with the VLM's native architecture. This approach enables systematic scaling along two axes: model size (up to 26B parameters) and context richness (task-aware instructions, reference examples, and chain-of-thought (CoT) reasoning). The framework demonstrates robust improvements in text-to-image, text-to-video, and image-to-video generation, with strong empirical evidence for enhanced resistance to reward hacking and improved output diversity.

Generative Reward Modeling Paradigm

Paradigm Shift: From Regression to Generation

Traditional regressive RMs predict scalar rewards using a regression head, optimized with Bradley-Terry loss on preference pairs. This approach is misaligned with the autoregressive, token-based nature of VLMs, resulting in suboptimal utilization of pre-trained knowledge and limited scalability. RewardDance addresses this by modeling the reward as the probability of generating a "yes" token in response to a comparative evaluation prompt:

re(x1,x2,y,i)=Pe("yes"∣x1,x2,y,i)r_e(x_1, x_2, y, i) = P_e(\text{"yes"} \mid x_1, x_2, y, i)

where x1x_1 and x2x_2 are image tokens, yy is the prompt, and ii is the task instruction. This generative formulation natively aligns with the VLM's next-token prediction, facilitating both model and context scaling.

Context Scaling

RewardDance extends the input context beyond simple image-prompt pairs by incorporating:

  • Task-aware instructions: Explicit criteria for evaluation (e.g., text-image alignment, style consistency).
  • Reference examples: Comparative judgments between candidate images.
  • Chain-of-Thought (CoT) reasoning: The model generates not only a "yes/no" decision but also a rationale, improving interpretability and reward signal quality.

This enriched context enables more precise, robust, and interpretable reward judgments, directly benefiting downstream RLHF and inference-time optimization.

Model Scaling

RewardDance systematically scales the RM using InternVL-based architectures from 1B to 26B parameters. Empirical results demonstrate a strong positive correlation between RM size and both reward evaluation performance and final generation quality. Larger RMs exhibit higher reward variance during RL training, indicating greater exploration and resistance to reward hacking.

Training and Alignment Pipeline

Reward Model Training

RewardDance is trained on preference data with both pairwise and pointwise generative variants. The pairwise variant uses reference images for comparative evaluation, while the pointwise variant evaluates single images against task instructions. Training employs a combination of Bradley-Terry loss and weighted cross-entropy to facilitate convergence.

RLHF and Inference-Time Scaling

  • RL Fine-tuning: The ReFL algorithm is used, with RewardDance providing preference signals. Best-of-N (BoN) sampling identifies high-quality reference images for each prompt, which are used in subsequent RL steps.
  • Inference-Time Scaling: A Search over Paths strategy prunes multiple generation trajectories using a lightweight, pointwise RewardDance verifier, enhancing output quality without retraining.

Empirical Results

Reward Model Scaling

  • Text-to-Image: Scaling the RM from 1B to 26B parameters yields substantial improvements in alignment scores (e.g., Seedream-3.0: 74.1 → 84.8).
  • Text-to-Video and Image-to-Video: GSB scores improve by up to +49% (T2V) and +47% (I2V) over SFT baselines with the 26B RM.
  • Generalization: OOD accuracy of the RM is a stronger predictor of RL performance than ID accuracy, highlighting the importance of generalization in reward modeling.

Comparison with SOTA

RewardDance-optimized models achieve state-of-the-art results on GenEval, Bench-240, and SeedVideoBench-1.0, outperforming both academic and commercial baselines in text-to-image and video generation tasks.

Ablation Studies

  • Paradigm Shift: Transitioning from regressive to generative reward modeling consistently improves performance.
  • Reference Quality: Higher-quality BoN reference images yield incremental gains.
  • CoT Reasoning: Incorporating CoT data provides substantial improvements in alignment scores.
  • Scaling Laws: Larger generative models derive greater benefits from larger RMs, with performance gains more pronounced for high-capacity architectures.

Reward Hacking and Mode Collapse

Larger RMs maintain higher reward variance during RL training, indicating sustained exploration and robustness against reward hacking. Smaller RMs converge prematurely, leading to mode collapse and reduced output diversity.

Implications and Future Directions

RewardDance establishes scalability—both in model size and context richness—as a foundational principle for visual reward modeling. The generative paradigm resolves the architectural mismatch of prior approaches, enabling more effective alignment of generative models with human preferences. The demonstrated resistance to reward hacking and mode collapse is particularly significant for stable RLHF in high-capacity diffusion models.

Future research directions include:

  • Scaling RMs beyond 26B parameters (e.g., 70B/100B) to further exploit scaling laws.
  • Extending capability dimensions to motion modeling, aesthetics, and unified understanding-generation tasks.
  • Advancing cross-modal reward signal scaling for many-to-vision tasks (e.g., audio/video-to-video).
  • Integrating richer context, reflection, and in-context learning mechanisms for even more robust reward modeling.

Conclusion

RewardDance introduces a scalable, generative reward modeling framework that aligns reward prediction with the autoregressive architecture of VLMs. By scaling both model size and context, RewardDance delivers consistent improvements in visual generation quality, robust resistance to reward hacking, and state-of-the-art performance across image and video generation tasks. This work establishes a new standard for reward modeling in visual generative systems and provides a clear path for future advancements in scalable, human-aligned visual generation.

alphaXiv