Kimi k1.5: Advanced Multimodal RL Model

Updated 16 August 2025

Kimi k1.5 is a large multimodal language model that leverages reinforcement learning to improve chain-of-thought reasoning, long-context planning, and cross-modal integration.
Its exceptional context scaling up to 128,000 tokens enables coherent multi-hop reasoning and efficient processing of extended stepwise plans.
Innovative short chain-of-thought techniques compress detailed reasoning into concise outputs, yielding state-of-the-art results in math, coding, and vision-language benchmarks.

Kimi k1.5 is a large multimodal LLM that leverages reinforcement learning (RL) at unprecedented scale to enhance stepwise reasoning, planning, and cross-domain performance. It introduces a streamlined RL pipeline, advanced long-context handling, and a suite of reasoning compression techniques, establishing new competitive baselines in benchmarks for mathematical, coding, and vision-language reasoning.

1. Reinforcement Learning Framework and Training Paradigm

Kimi k1.5 departs from conventional next-token prediction objectives, introducing a multi-stage RL-centric training pipeline. The overall regimen includes:

Pretraining: Diverse, high-quality multimodal corpora comprising English, Chinese, code, mathematical reasoning, real and synthetic visual inputs, and text-rendered images.
Supervised Fine-Tuning (SFT): Focused on long chain-of-thought (CoT) reasoning i.e., sequential problem-solving demonstrations.
RL Phase: Iterative synchronous RL, in which the model samples intermediate CoT and final answers, maximizing an expected reward (correctness of final answer) regularized by keeping the updated policy near the current one.

The RL objective can be formalized as an online mirror descent update:

$\pi^*(y, z | x) = \pi_\text{current}(y, z | x) \exp\left( \frac{r(x, y, y^*)}{\tau} \right) / Z$

where $r(x, y, y^*)$ is the reward (usually correctness-based), $\tau$ controls regularization, and $Z$ is a normalization constant. The policy is updated by maximizing the surrogate loss based on empirical mean rewards. A length penalty is incorporated to encourage succinct stepwise solutions by penalizing excessive and redundant chains.

Notably, key RL ingredients, such as value functions, process reward models, or Monte Carlo tree search, are omitted. Instead, policy updates directly maximize reward with entropy regularization.

Infrastructure-wise, innovations such as “partial rollouts” divide long trajectories to improve computational throughput. A hybrid Megatron and vLLM deployment strategy, together with a replay buffer and a checkpoint engine, ensures both training and inference efficiency.

2. Long Context Scaling and Advanced Policy Optimization

Kimi k1.5 is distinguished by its exceptional context window scaling—up to 128,000 tokens during RL—enabling the model to represent and act on long stepwise plans, revisit previous steps, and perform complex cross-referencing. This capability is vital for tasks that require extended multi-hop reasoning or coherent document-level interaction.

Policy optimization is performed via a relative-entropy-regularized objective, updated through online mirror descent. This approach simultaneously favors reward maximization (exploration) and penalizes excessive deviation from the sampled (“current”) policy (exploitation). The regularization ensures stability and practical convergence even in the presence of long sequences and sparse rewards. The framework is compatible with off-policy data and supports dynamic generation reuse.

These enhancements obviate many of the challenges experienced in prior RL LLMs, such as sample inefficiency and brittle convergence due to unstable reward landscapes.

3. Short Chain-of-Thought (Long2Short) Methods

To address inference efficiency and practical deployment, Kimi k1.5 introduces several “long2short” techniques to distill long-form step-by-step reasoning into concise outputs without loss of accuracy. Key strategies include:

Model Merging: Weight-averaging between fully long-CoT and short-CoT models.
Shortest Rejection Sampling: Multiplicative prompt sampling, selecting the shortest correct generation for further tuning.
Direct Preference Optimization (DPO): Preference learning, using pairs of long and short correct responses to encourage the model to select concise solutions in training.
Long2Short Reinforcement Learning: An RL phase with aggressive length penalties and reduced max rollouts, making short, correct solutions disproportionately favored.

These methods collectively compress the stepwise reasoning process and enable high performance at much lower token/inference cost—a critical property for production-scale LLMs.

Kimi k1.5 is trained multimodally, receiving both text and image data in pretraining, fine-tuning, and RL stages. The data mixture includes:

Textual reasoning in English, Chinese, code, and mathematical expressions.
Image-based reasoning, using both real-world photographs and synthetic images tailored to visual problem solving.
Text-rendered images, allowing the model to bridge between visual and symbolic representations.

This diversity builds robust cross-modal representations and supports strong performance on vision-language benchmarks such as MathVista.

5. Benchmark Evaluations and Performance Characteristics

The performance of Kimi k1.5 is quantified in both long and short chain-of-thought regimes:

Regime	AIME	MATH500	Codeforces (Percentile)	MathVista	LiveCodeBench
Long CoT	77.5	96.2	94	74.9	—
Short CoT	60.8	94.6	—	—	47.3

In long-CoT mode, performance matches leading models such as OpenAI's o1. In short-CoT mode, Kimi k1.5 outperforms GPT-4o and Claude Sonnet 3.5, sometimes with an improvement margin up to +550%.

This demonstrates that the long2short transfer and RL optimization result in both state-of-the-art accuracy and token efficiency.

6. Chain-of-Thought Reflection Mechanism and Robustness

Kimi k1.5 incorporates a reflection mechanism where the model verifies intermediate steps (self-reflection, self-correction). According to MME-CoT benchmark analyses (Jiang et al., 13 Feb 2025), this mechanism yields the highest CoT reasoning precision scores across math and logic domains, outperforming GPT-4o.

Reflection improves reasoning quality but introduces inefficiency: over 25% of reflection steps are either redundant or contain non-informative/distracting content, increasing completion times and sometimes decreasing answer focus—especially in perception-heavy tasks. The benefit is thus most pronounced in deep, stepwise multimodal reasoning rather than perception or simple retrieval.

7. Reasoning Process Characteristics and Design Trade-Offs

Analyses of Kimi k1.5's internal "thinking" patterns (Liu et al., 20 Jun 2025) reveal:

Extremely deep stepwise reflection (high Total Reflection Count, TRC) in coding; less so in domains like medical or finance—often favoring direct recall.
Consistency with high-quality reference models (e.g., GPT-o1) is domain dependent: strong in structured tasks (e.g., riddles), sporadic or rigid in others.
The model tends to use fixed reflection cues (e.g., repetitive "let me check"), sacrificing adaptability and variance in open-ended or abstract domains.

This strategy serves computational efficiency and robustness where explicit derivation is critical (e.g., programming, math), but limits generalization and nuance.

Recommendations for future improvements include enhancing the diversity of reflective cues, increasing training diversity in underperforming domains, and adopting multi-objective RL to jointly maximize depth, breadth, and output consistency.

Kimi k1.5 synthesizes advanced RL training, large-scale multimodal representation, long-context planning, and efficient reasoning distillation, setting new standards in compositional reasoning and practical LLM deployment. The trade-offs between stepwise robustness, computational efficiency, and adaptability underscore the design tensions at the frontier of RL-based LLM research (Team et al., 22 Jan 2025, Jiang et al., 13 Feb 2025, Liu et al., 20 Jun 2025).

PDF Markdown Chat (Pro)

References (3)

MME-CoT: Benchmarking Chain-of-Thought in Large Multimodal Models for Reasoning Quality, Robustness, and Efficiency (2025)

From Thinking to Output: Chain-of-Thought and Text Generation Characteristics in Reasoning Language Models (2025)

Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025)

Follow Topic

Get notified by email when new papers are published related to Kimi k1.5.