Kimi k1.5: Scaling Reinforcement Learning with LLMs (2501.12599v4)

Published 22 Jan 2025 in cs.AI and cs.LG

Abstract: LLM pretraining with next token prediction has proved effective for scaling compute but is limited to the amount of available training data. Scaling reinforcement learning (RL) unlocks a new axis for the continued improvement of artificial intelligence, with the promise that LLMs can scale their training data by learning to explore with rewards. However, prior published work has not produced competitive results. In light of this, we report on the training practice of Kimi k1.5, our latest multi-modal LLM trained with RL, including its RL training techniques, multi-modal data recipes, and infrastructure optimization. Long context scaling and improved policy optimization methods are key ingredients of our approach, which establishes a simplistic, effective RL framework without relying on more complex techniques such as Monte Carlo tree search, value functions, and process reward models. Notably, our system achieves state-of-the-art reasoning performance across multiple benchmarks and modalities -- e.g., 77.5 on AIME, 96.2 on MATH 500, 94-th percentile on Codeforces, 74.9 on MathVista -- matching OpenAI's o1. Moreover, we present effective long2short methods that use long-CoT techniques to improve short-CoT models, yielding state-of-the-art short-CoT reasoning results -- e.g., 60.8 on AIME, 94.6 on MATH500, 47.3 on LiveCodeBench -- outperforming existing short-CoT models such as GPT-4o and Claude Sonnet 3.5 by a large margin (up to +550%).

Summary

The paper introduces Kimi k1.5, a multimodal LLM trained via reinforcement learning with long-context scaling and dynamic policy optimization.
It employs innovative techniques such as partial rollouts, length penalties, and curriculum sampling to enhance reasoning and improve training efficiency.
The model achieves state-of-the-art performance in reasoning, math, and coding benchmarks while enabling efficient knowledge transfer from long to short reasoning chains.

This paper introduces Kimi k1.5, a new multimodal LLM that is trained using reinforcement learning (RL). The primary goal of this research is to explore how RL can be used to scale LLMs beyond the limitations of traditional pretraining methods, which rely on static datasets. The paper emphasizes that, unlike pretraining, RL allows models to learn from interactions and rewards, enabling them to explore and improve dynamically. The authors highlight the significance of long context scaling and improved policy optimization techniques as key components of their approach.

Background and Motivation

The prevailing approach to training LLMs involves pretraining with next token prediction. While this method has proven effective in scaling with computational resources, it is inherently limited by the amount of high-quality training data available. Reinforcement learning offers a potential solution by enabling models to learn through interaction with an environment and from rewards, thereby reducing reliance on pre-existing static datasets. However, previous attempts at using RL for LLMs have not yielded competitive results. This paper describes the methods used to train Kimi k1.5 with RL to achieve state-of-the-art reasoning capabilities.

Key Components of Kimi k1.5

The development of Kimi k1.5 involved several key strategies:

Long Context Scaling: The model is trained with a context window of 128,000 tokens, significantly larger than many other models. To improve training efficiency, the authors use "partial rollouts," where a large portion of previous trajectories are reused, avoiding the need to regenerate trajectories from scratch. This method also allows for longer reasoning chains and more thorough exploration of the solution space.
Improved Policy Optimization: The paper introduces a new approach to policy optimization that is based on online mirror descent, which is enhanced by a sampling strategy, a length penalty, and an optimization of the data recipe. This method is designed to ensure robust policy updates during training.
Simplistic Framework: The approach combines long context scaling and improved policy optimization methods, establishing a straightforward RL framework. By scaling the context length, the model can perform planning, reflection, and correction, mimicking a more human-like approach to problem-solving. This eliminates the need for complex techniques like Monte Carlo tree search, value functions, and process reward models.
Multimodality: The model is trained jointly on both text and vision data to enable reasoning across multiple modalities.

Training Methodology

The training of Kimi k1.5 follows several stages:

Pretraining: The base model is initially trained using a large corpus of text and image data.
Vanilla Supervised Fine-Tuning (SFT): The pretrained model is then fine-tuned on general tasks using human-annotated and rule-based data.
Long-Chain-of-Thought (CoT) SFT: The model is further fine-tuned using a small high-quality dataset of long reasoning chains to prime the model to internalize complex reasoning.
Reinforcement Learning (RL): The core focus of the paper is using RL with the model to enhance its reasoning and problem-solving capabilities.

The RL phase is further broken down into these steps:

RL Prompt Set Curation: The quality of the RL prompt set is critical for effective training. The authors ensure the prompt set is diverse, balanced in difficulty, and accurately evaluable by verifiers. This is done by using automatic filters, a tagging system to categorize questions by domain, and a difficulty assessment that leverages the model's own performance to gauge the complexity of a given prompt. To avoid reward hacking, the prompts are designed so that both the reasoning process and the final answer can be accurately verified. Questions that can be easily guessed are removed.
Long-CoT Supervised Fine-Tuning: A small, high-quality dataset is constructed to warm up the model, using prompt engineering to create long reasoning paths for both text and image inputs. This dataset emphasizes key cognitive processes, such as planning, evaluation, reflection, and exploration, which helps the model to generate detailed and coherent responses.
Reinforcement Learning: The goal of the RL phase is to train the model to generate intermediate steps (thoughts) that bridge the gap between a problem and its solution. Instead of constructing a search tree of thoughts, they train the model to approximate this process directly.

Policy Optimization

The model is trained using a variant of online policy mirror descent. This involves iteratively updating the model's policy using a relative entropy regularized objective. The approach can be summarized as follows:

The model samples responses using the current policy.
A reward model evaluates the correctness of the response.
The policy is updated using the following surrogate loss:

$L(\theta) = \mathbb{E}_{(x,y^*)\sim\mathcal{D}\left[ \mathbb{E}_{(y, z)\sim\pi_{\theta_i} \left[ \left( r(x, y, y^*) - \tau \log Z - \tau \log \frac{\pi_\theta(y, z | x)}{\pi}_{\theta_i}(y, z | x)} \right)^2 \right]\right]$

where:

* $L(\theta)$ is the loss function being minimized. * $\mathbb{E}$ is the expected value. * $(x, y^*)$ represents a problem and its ground truth answer sampled from the dataset $\mathcal{D}$ . * $(y, z)$ represents a sampled response and its reasoning steps sampled from policy $\pi_{\theta_i}$ . * $r(x, y, y^*)$ is the reward for the response $y$ given the problem $x$ and ground truth answer $y^*$ . * $\tau$ is a temperature parameter that controls the level of regularization. * $\log Z$ is a normalization factor to control the policy update. * $\pi_\theta(y, z | x)$ is the new policy that is being optimized. * $\pi_{\theta_i}(y, z | x)$ is the current reference policy.

The normalization term $\tau \log Z$ is approximated using samples, and the gradient of the loss is used to update the model parameters. This process resembles a standard policy gradient method, but uses off-policy data and includes an $l_2$ regularization term.

Length Penalty and Sampling Strategies

Length Penalty: To prevent the model from generating excessively long responses, a length penalty is introduced. This penalty reduces the reward for longer responses and promotes more concise and efficient reasoning. The length reward is calculated by:
- $\text{len\_reward(i)}$ is the length reward for the i-th sampled response.
- $r(x, y_i, y^*)$ is the correctness of the answer with 1 if correct and 0 if incorrect.
- $\lambda$ is a term that scales the length penalty.
- $\text{len}(i)$ is the length of the i-th sampled response.
- $\text{min\_len}$ is the minimum length of sampled responses.
- $\text{max\_len}$ is the maximum length of sampled responses.
Curriculum Sampling: The training starts with easier tasks and gradually progresses to more complex ones, allowing the model to build a solid foundation of problem-solving skills.
Prioritized Sampling: Problems on which the model underperforms are sampled more frequently during training to focus on areas where the model needs more improvement.

Long2Short Methods

The paper also explores methods to compress the knowledge gained by long-CoT models into short-CoT models, which use fewer computational resources at inference time. These include:

Model Merging: This approach combines the weights of a long-CoT model and a short-CoT model by averaging them.
Shortest Rejection Sampling: The model generates multiple responses to the same problem, and the shortest correct response is selected for fine-tuning.
Direct Preference Optimization (DPO): Using pairwise preference data, this method trains the short-CoT model by treating the shortest correct solution as a positive sample and longer responses as negative samples.
Long2short RL: After the standard RL phase, a model is chosen that balances performance and token efficiency, and another RL training phase is conducted to further penalize responses that exceed the desired length.

Experimental Results

The Kimi k1.5 model achieves state-of-the-art performance across several benchmarks, including those for reasoning, math, and coding. The results are detailed in tables in the paper. Notably, the long-CoT model shows significant improvements in complex reasoning tasks, while the short-CoT model, enhanced by the long2short method, outperforms other short-CoT models. The paper includes a variety of ablations, including comparisons of different optimization algorithms, sampling strategies, and model sizes, and shows that long context training and proper penalty functions are critical to effective RL training.

RL Infrastructure

The RL training system is designed for scalability and efficiency. It employs an iterative synchronous framework with:

Partial Rollouts: To handle the long trajectories, unfinished trajectories are saved and continued in the next training iteration, which also reduces the computation overhead.
Hybrid Deployment: The training and inference processes are collocated on the same GPU resources using Kubernetes. This allows resources to be efficiently shared between training and inference and the system can dynamically scale the number of inference nodes when needed.
Code Sandbox: A secure environment for executing code is used to automatically generate test cases for coding problems which serves as a reward function for the RL algorithm.

Conclusion

This paper presents Kimi k1.5, a new multimodal LLM trained with RL, emphasizing the importance of context length scaling for continued improvements in LLMs. The authors describe their RL framework, policy optimization techniques, and the use of various sampling strategies and length penalties to improve the model's performance. By scaling the context length and optimizing the learning algorithm, they demonstrate how to achieve strong performance without relying on more complex methods. The paper also shows that the knowledge gained by long context models can be transferred to short context models, highlighting the potential for more efficient deployment of advanced AI models.