Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model (2501.12368v1)

Published 21 Jan 2025 in cs.CV and cs.CL
InternLM-XComposer2.5-Reward: A Simple Yet Effective Multi-Modal Reward Model

Abstract: Despite the promising performance of Large Vision LLMs (LVLMs) in visual understanding, they occasionally generate incorrect outputs. While reward models (RMs) with reinforcement learning or test-time scaling offer the potential for improving generation quality, a critical gap remains: publicly available multi-modal RMs for LVLMs are scarce, and the implementation details of proprietary models are often unclear. We bridge this gap with InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a simple yet effective multi-modal reward model that aligns LVLMs with human preferences. To ensure the robustness and versatility of IXC-2.5-Reward, we set up a high-quality multi-modal preference corpus spanning text, image, and video inputs across diverse domains, such as instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. IXC-2.5-Reward achieves excellent results on the latest multi-modal reward model benchmark and shows competitive performance on text-only reward model benchmarks. We further demonstrate three key applications of IXC-2.5-Reward: (1) Providing a supervisory signal for RL training. We integrate IXC-2.5-Reward with Proximal Policy Optimization (PPO) yields IXC-2.5-Chat, which shows consistent improvements in instruction following and multi-modal open-ended dialogue; (2) Selecting the best response from candidate responses for test-time scaling; and (3) Filtering outlier or noisy samples from existing image and video instruction tuning training data. To ensure reproducibility and facilitate further research, we have open-sourced all model weights and training recipes at https://github.com/InternLM/InternLM-XComposer

This paper introduces InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a multi-modal reward model designed to align Large Vision LLMs (LVLMs) with human preferences. The paper addresses the scarcity of publicly available multi-modal reward models and the lack of clarity surrounding the implementation details of proprietary models.

To ensure IXC-2.5-Reward's robustness and versatility, the authors constructed a high-quality multi-modal preference corpus. This corpus spans text, image, and video inputs across diverse domains, including instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. The model achieves a 70.0% accuracy on the VL-RewardBench benchmark, surpassing previous generative RMs like Gemini-1.5-Pro (62.5%) and GPT-4o (62.4%). Even on uni-modal (text) RM benchmarks, IXC-2.5-Reward demonstrates good results, with an average score of 88.6% on Reward-Bench and 68.8% on RM-Bench.

Key aspects of the paper include:

  • Data Preparation: The authors collected a multi-modal preference dataset, incorporating existing high-quality datasets and newly collected data. The newly collected data includes prompts across diverse domains for text, image, and video inputs. The pipeline selects prompts across diverse domains for text, image, and video inputs, generates corresponding responses, and then uses GPT-4o or verifiers to perform preference judgments. Open-source pairwise data focuses on instruction following, safety, and general knowledge. The new data includes text-rich document understanding, math reasoning, and video understanding. The authors prompted the supervised fine-tuning (SFT) model, InternLM-XComposer-2.5 (IXC-2.5), to generate multiple outputs for each prompt to obtain rejected responses.
  • Model Architecture: IXC-2.5-Reward is built upon the SFT model (IXC-2.5). The pre-trained weights of IXC-2.5-Chat are used for the visual encoder and the MLP (Multi-Layer Perceptron) projector. The final linear layer of IXC-2.5 is replaced with a score head ff for IXC-2.5-Reward that predicts the reward score. Given an input prompt xx and response yy, the score head ff transforms the averaged hidden state features of all tokens into a scalar r(x,y)r(x, y), which serves as the predicted reward score for the inputs.
    • r(x,y)r(x, y): Predicted reward score for prompt xx and response yy
  • Loss Function: The reward model is trained via the loss function:

    $\mathcal{L}_{\text{RM} = - E(\log(\sigma(r(x, y_{w})) - r(x, y_{l})))$

    • LRM\mathcal{L}_{\text{RM}}: Reward Model Loss
    • EE: Expectation
    • σ\sigma: Sigmoid function
    • r(x,yw)r(x, y_{w}): Reward score assigned to the prompt xx with the chosen data ywy_{w}
    • r(x,yl)r(x, y_{l}): Reward score assigned to the prompt xx with the rejected data yly_{l}
  • Training Strategy: The model's vision encoder and projector are frozen and initialized from IXC-2.5, training only the LLM (InternLM) and the score head.
  • Length Constraints: Data pairs are removed where the length of the chosen response ywy_{w} is significantly longer than the length of the rejected response yly_{l} to prevent the reward model from learning to associate length with quality.

The paper demonstrates three key applications of IXC-2.5-Reward:

  1. RL Training: IXC-2.5-Reward provides a supervisory signal for reinforcement learning training. The authors integrated IXC-2.5-Reward with Proximal Policy Optimization (PPO) to yield IXC-2.5-Chat, which shows improvements in instruction following and multi-modal open-ended dialogue. The PPO training involves sampling a prompt from a prompt set. The policy θπ\theta_{\pi} model generates responses, and the reward model computes the reward score rtr_{t} at each state sts_{t} at time-step tt. The temporal difference error δt\delta_{t}, the Generalized Advantage Estimation (GAE) AtA_{t}, and the Returns RtR_{t} are computed as:

    • δt=rt+γV(st+1)V(st)\delta_{t} = r_{t} + \gamma \cdot V(s_{t+1}) - V(s_{t})
    • At=δt+γβAt+1A_{t} = \delta_{t} + \gamma \cdot \beta \cdot A_{t+1}
    • Rt=At+V(st)R_{t} = A_{t} + V(s_{t})
    • rtr_{t}: Reward score at each state sts_{t} at time-step tt
    • VV: Critic Model
    • γ\gamma: Discount factor
    • β\beta: Parameter controlling the trade-off between bias and variance in advantage estimation. Based on the advantage AA, the policy gradient loss $\mathcal{L}_{\text{PG}$ is computed to update the policy model πθ\pi_{\theta}:

    $\mathcal{L}_{\text{PG} = \min(\frac{\pi_{\theta}}{\pi_{\text{ref}} \cdot A, \text{clip}(\frac{\pi_{\theta}}{\pi_{\text{ref}}, 1.0 - \epsilon, 1.0 + \epsilon) \cdot A)$

* LPG\mathcal{L}_{\text{PG}}: Policy Gradient Loss * πθπref\frac{\pi_{\theta}}{\pi_{\text{ref}}}: Log of the probability ratio between the policy model πθ\pi_{\theta} and the reference model πref\pi_{\text{ref}} * ϵ\epsilon: Hyper-parameter that controls the clipped ratio. The critic model is updated via the Mean Squared Error (MSE) loss:

$\mathcal{L}_{\text{critic} = \sum_{t} \text{MSE}( V(s_{t}), R_{t} )$

* Lcritic\mathcal{L}_{\text{critic}}: Critic Loss

  1. Test-Time Scaling: IXC-2.5-Reward selects the best response from candidate responses for test-time scaling. The authors use Best-of-NN sampling with IXC-2.5-Reward, leading to performance gains compared to the RL (Reinforcement Learning)-trained IXC-2.5-Chat.
  2. Data Cleaning: IXC-2.5-Reward filters outlier or noisy samples from existing image and video instruction tuning training data. The authors observe a correlation between low IXC-2.5-Reward scores and problematic samples, such as those exhibiting hallucinations or mismatched image/video and question/answer content.

The experimental results demonstrate that IXC-2.5-Reward achieves state-of-the-art performance on multi-modal reward model benchmarks and shows competitive performance on text-only reward model benchmarks. The authors also present visualization examples of IXC-2.5-Chat on a series of topics, such as instruction following and open-ended questions. These figures reveal that IXC-2.5-Chat demonstrates several key advantages, including superior organization and presentation, more comprehensive and in-depth answers, and more detailed explanations.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Yuhang Zang (54 papers)
  2. Xiaoyi Dong (73 papers)
  3. Pan Zhang (153 papers)
  4. Yuhang Cao (41 papers)
  5. Ziyu Liu (47 papers)
  6. Shengyuan Ding (5 papers)
  7. Shenxi Wu (4 papers)
  8. Yubo Ma (22 papers)
  9. Haodong Duan (55 papers)
  10. Wenwei Zhang (77 papers)
  11. Kai Chen (512 papers)
  12. Dahua Lin (336 papers)
  13. Jiaqi Wang (218 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com