This paper introduces InternLM-XComposer2.5-Reward (IXC-2.5-Reward), a multi-modal reward model designed to align Large Vision LLMs (LVLMs) with human preferences. The paper addresses the scarcity of publicly available multi-modal reward models and the lack of clarity surrounding the implementation details of proprietary models.
To ensure IXC-2.5-Reward's robustness and versatility, the authors constructed a high-quality multi-modal preference corpus. This corpus spans text, image, and video inputs across diverse domains, including instruction following, general understanding, text-rich documents, mathematical reasoning, and video understanding. The model achieves a 70.0% accuracy on the VL-RewardBench benchmark, surpassing previous generative RMs like Gemini-1.5-Pro (62.5%) and GPT-4o (62.4%). Even on uni-modal (text) RM benchmarks, IXC-2.5-Reward demonstrates good results, with an average score of 88.6% on Reward-Bench and 68.8% on RM-Bench.
Key aspects of the paper include:
- Data Preparation: The authors collected a multi-modal preference dataset, incorporating existing high-quality datasets and newly collected data. The newly collected data includes prompts across diverse domains for text, image, and video inputs. The pipeline selects prompts across diverse domains for text, image, and video inputs, generates corresponding responses, and then uses GPT-4o or verifiers to perform preference judgments. Open-source pairwise data focuses on instruction following, safety, and general knowledge. The new data includes text-rich document understanding, math reasoning, and video understanding. The authors prompted the supervised fine-tuning (SFT) model, InternLM-XComposer-2.5 (IXC-2.5), to generate multiple outputs for each prompt to obtain rejected responses.
- Model Architecture: IXC-2.5-Reward is built upon the SFT model (IXC-2.5). The pre-trained weights of IXC-2.5-Chat are used for the visual encoder and the MLP (Multi-Layer Perceptron) projector. The final linear layer of IXC-2.5 is replaced with a score head for IXC-2.5-Reward that predicts the reward score. Given an input prompt and response , the score head transforms the averaged hidden state features of all tokens into a scalar , which serves as the predicted reward score for the inputs.
- : Predicted reward score for prompt and response
- Loss Function: The reward model is trained via the loss function:
$\mathcal{L}_{\text{RM} = - E(\log(\sigma(r(x, y_{w})) - r(x, y_{l})))$
- : Reward Model Loss
- : Expectation
- : Sigmoid function
- : Reward score assigned to the prompt with the chosen data
- : Reward score assigned to the prompt with the rejected data
- Training Strategy: The model's vision encoder and projector are frozen and initialized from IXC-2.5, training only the LLM (InternLM) and the score head.
- Length Constraints: Data pairs are removed where the length of the chosen response is significantly longer than the length of the rejected response to prevent the reward model from learning to associate length with quality.
The paper demonstrates three key applications of IXC-2.5-Reward:
- RL Training: IXC-2.5-Reward provides a supervisory signal for reinforcement learning training. The authors integrated IXC-2.5-Reward with Proximal Policy Optimization (PPO) to yield IXC-2.5-Chat, which shows improvements in instruction following and multi-modal open-ended dialogue. The PPO training involves sampling a prompt from a prompt set. The policy model generates responses, and the reward model computes the reward score at each state at time-step . The temporal difference error , the Generalized Advantage Estimation (GAE) , and the Returns are computed as:
- : Reward score at each state at time-step
- : Critic Model
- : Discount factor
- : Parameter controlling the trade-off between bias and variance in advantage estimation. Based on the advantage , the policy gradient loss $\mathcal{L}_{\text{PG}$ is computed to update the policy model :
$\mathcal{L}_{\text{PG} = \min(\frac{\pi_{\theta}}{\pi_{\text{ref}} \cdot A, \text{clip}(\frac{\pi_{\theta}}{\pi_{\text{ref}}, 1.0 - \epsilon, 1.0 + \epsilon) \cdot A)$
* : Policy Gradient Loss * : Log of the probability ratio between the policy model and the reference model * : Hyper-parameter that controls the clipped ratio. The critic model is updated via the Mean Squared Error (MSE) loss:
$\mathcal{L}_{\text{critic} = \sum_{t} \text{MSE}( V(s_{t}), R_{t} )$
* : Critic Loss
- Test-Time Scaling: IXC-2.5-Reward selects the best response from candidate responses for test-time scaling. The authors use Best-of- sampling with IXC-2.5-Reward, leading to performance gains compared to the RL (Reinforcement Learning)-trained IXC-2.5-Chat.
- Data Cleaning: IXC-2.5-Reward filters outlier or noisy samples from existing image and video instruction tuning training data. The authors observe a correlation between low IXC-2.5-Reward scores and problematic samples, such as those exhibiting hallucinations or mismatched image/video and question/answer content.
The experimental results demonstrate that IXC-2.5-Reward achieves state-of-the-art performance on multi-modal reward model benchmarks and shows competitive performance on text-only reward model benchmarks. The authors also present visualization examples of IXC-2.5-Chat on a series of topics, such as instruction following and open-ended questions. These figures reveal that IXC-2.5-Chat demonstrates several key advantages, including superior organization and presentation, more comprehensive and in-depth answers, and more detailed explanations.