Kimi-VL Technical Report (2504.07491v3)

Published 10 Apr 2025 in cs.CV

Abstract: We present Kimi-VL, an efficient open-source Mixture-of-Experts (MoE) vision-LLM (VLM) that offers advanced multimodal reasoning, long-context understanding, and strong agent capabilities - all while activating only 2.8B parameters in its language decoder (Kimi-VL-A3B). Kimi-VL demonstrates strong performance across challenging domains: as a general-purpose VLM, Kimi-VL excels in multi-turn agent tasks (e.g., OSWorld), matching flagship models. Furthermore, it exhibits remarkable capabilities across diverse challenging vision language tasks, including college-level image and video comprehension, OCR, mathematical reasoning, and multi-image understanding. In comparative evaluations, it effectively competes with cutting-edge efficient VLMs such as GPT-4o-mini, Qwen2.5-VL-7B, and Gemma-3-12B-IT, while surpassing GPT-4o in several key domains. Kimi-VL also advances in processing long contexts and perceiving clearly. With a 128K extended context window, Kimi-VL can process diverse long inputs, achieving impressive scores of 64.5 on LongVideoBench and 35.1 on MMLongBench-Doc. Its native-resolution vision encoder, MoonViT, further allows it to see and understand ultra-high-resolution visual inputs, achieving 83.2 on InfoVQA and 34.5 on ScreenSpot-Pro, while maintaining lower computational cost for common tasks. Building upon Kimi-VL, we introduce an advanced long-thinking variant: Kimi-VL-Thinking-2506. Developed through long chain-of-thought (CoT) supervised fine-tuning (SFT) and reinforcement learning (RL), the latest model exhibits strong long-horizon reasoning capabilities (64.0 on MMMU, 46.3 on MMMU-Pro, 56.9 on MathVision, 80.1 on MathVista, 65.2 on VideoMMMU) while obtaining robust general abilities. Code and models are publicly accessible at https://github.com/MoonshotAI/Kimi-VL.

Summary

The paper introduces a Mixture-of-Experts architecture that activates only 2.8B parameters in its language decoder to achieve efficiency.
It employs a specialized pipeline combining MoonViT for native resolution image encoding, an MLP projector, and an MoE language model for robust multimodal reasoning.
It uses multi-stage training, extended long-context activation, and supervised fine-tuning to deliver strong performance across OCR, math, agent tasks, and video understanding.

This technical report introduces Kimi-VL, an open-source Vision-LLM (VLM) built using a Mixture-of-Experts (MoE) architecture. The goal is to provide a model that is computationally efficient (activating only 2.8B parameters in its language decoder for the Kimi-VL-A3B variant) while delivering strong performance in multimodal reasoning, long-context understanding, and agent capabilities. It aims to address the gap in open-source VLMs, particularly regarding scalability, efficiency, and advanced reasoning compared to proprietary models and language-only counterparts.

Model Architecture

Kimi-VL's architecture comprises three main components:

MoonViT Vision Encoder: A 400M parameter Vision Transformer designed to process images at their native resolutions without requiring complex sub-image splitting. It uses patch packing (similar to NaViT (2307.06304)) to handle variable resolutions efficiently. Positional information is encoded using interpolated absolute embeddings (from SigLIP initialization) combined with 2D Rotary Positional Embeddings (RoPE (2104.09864)) for better fine-grained detail, especially in high-resolution images. This allows processing images of different sizes within the same batch efficiently using mechanisms like FlashAttention (2205.14135).
MLP Projector: A two-layer MLP bridges MoonViT and the LLM. It first uses pixel shuffle (2x2 downsampling spatially, expanding channel dimension) to compress image features before projecting them to the LLM's embedding dimension.
MoE LLM: Based on the Moonlight MoE LLM (2405.13901), it has 16B total parameters but only activates 2.8B per token. It was initialized from a Moonlight checkpoint pre-trained on 5.2T text tokens with an 8K context length.

+-----------------+      +-----------------+      +----------------------+
| Input Image(s)  | ---> | MoonViT Encoder | ---> | MLP Projector        |
| (Native Res)    |      | (Patch Packing, |      | (Pixel Shuffle + MLP)|
+-----------------+      | 2D RoPE)        |      +----------------------+
                         +-----------------+               |
                                                           |
                                   +-----------------------+
                                   |
+-----------------+      +---------v---------+      +-----------------+
| Input Text      | ---> | MoE LLM| ---> | Output Text     |
| (Tokenized)     |      | (Moonlight Base)  |      | (Generated)     |
+-----------------+      | (2.8B Activated)  |      +-----------------+
                         +-------------------+

Training Process

The training involves several stages, optimized using an enhanced Muon optimizer (2405.13901, 2406.07402) with a distributed ZeRO-1 (2002.08880) implementation.

Pre-Training (Total 4.4T tokens post-LLM pre-training):

ViT Training (2T + 0.1T tokens): MoonViT is trained standalone on image-text pairs (alt text, synthetic captions, grounding, OCR) using SigLIP contrastive loss (2301.07520) and caption generation loss (similar to CoCa (2205.01917)). It starts from SigLIP weights and uses progressive resolution sampling. An alignment stage (0.1T tokens) fine-tunes only MoonViT and the projector to align with the frozen LLM, reducing initial perplexity for joint training.
Joint Pre-training (1.4T tokens, 8K context): The full model (ViT, projector, LLM) is trained on a mix of text-only data (from Moonlight's distribution) and multimodal data (caption, interleaving, OCR, knowledge, video, agent). The proportion of multimodal data increases gradually to preserve language capabilities.
Joint Cooldown (0.6T tokens, 8K context): Training continues on higher-quality data. Text data includes high-fidelity subsets and synthetic QA for math, knowledge, and code (using rejection sampling). Multimodal data includes QA synthesis, high-quality subset replay, and reformatted academic visual data. QA data ratio is kept low to avoid overfitting patterns.
Joint Long-context Activation (0.3T tokens, 8K -> 128K context): Context length is extended in two 4x steps (8K->32K, 32K->128K) by resetting RoPE frequency. Data includes long text, long interleaved documents, long videos, and long documents. Long data constitutes 25%, with 75% replaying shorter data. NIAH tests show good recall up to 128K for both text and video (Table 2).

Post-Training:

Joint Supervised Fine-tuning (SFT): The base model is fine-tuned for instruction following and dialogue using the ChatML format on a mix of text-only and multimodal SFT data. Only answers and special tokens are supervised. Training involves two epochs: first at 32K sequence length, then at 128K. Multiple examples are packed into sequences.
Long-CoT SFT (for Kimi-VL-Thinking): A lightweight SFT stage uses a small, high-quality dataset of long Chain-of-Thought reasoning paths (generated via prompt engineering and filtering, resembling rejection sampling) covering planning, evaluation, reflection, and exploration. This primes the model for better reasoning.
Reinforcement Learning (RL) (for Kimi-VL-Thinking): A variant of online policy mirror descent is used, optimizing policy $\pi_\theta$ against a reference policy $\pi_{\theta_i}$ with KL regularization:

$\max_\theta \mathbb{E}_{(x, y^*)\sim\mathcal{D}}\left[ \mathbb{E}_{(y, z)\sim\pi_\theta} \left[r(x, y, y^*)\right] - \tau \mathrm{KL} (\pi_{\theta}(x) || \pi_{\theta_i}(x)) \right]$

where $r$ is a binary reward based on ground truth $y^*$ . It incorporates a length penalty against overthinking, curriculum sampling, and prioritized sampling based on difficulty and success rates. This aims to internalize complex reasoning procedures.

Data Construction

High-quality data across modalities is crucial. Key data types include:

Caption: Open-source (LAION (2210.08401), DataComp (2401.17742)) and in-house data. Quality control, resolution variation, limited synthetic data.
Interleaving: Multi-image understanding, detailed knowledge, language ability maintenance. Sources: Open-source (MMC4 (2402.05106), OBELICS (2402.10513)), in-house (textbooks, webpages), synthesis. Data reordering ensures correctness.
OCR: Diverse sources (open, in-house), single/multi-page, clean/augmented, multilingual, various types (figures, tables, geometry, handwritten). Extensive augmentation. Multi-page data enables long document understanding.
Knowledge: Multimodal knowledge from textbooks, papers, web. Layout parser and OCR used. Additional text extraction pipeline for infographics. Standardized taxonomy ensures diversity.
Agent: Grounding and planning data. Public datasets plus in-house data from virtual machines (screenshots, actions). Action space designed for Desktop/Mobile/Web. Icon data. Human trajectories with synthesized CoT (Aguvis (2406.14833)).
Video: Long-context understanding and fine-grained spatio-temporal perception. Open-source and web-scale in-house data. Various durations, scenes, tasks (description, grounding). Dense captions generated for long videos (limited synthesis).
Text: Moonlight corpus covering English, Chinese, Code, Math/Reasoning, Knowledge. Rigorous filtering and quality control. Empirically determined sampling strategy, upsampling high-value subsets while maintaining diversity.
Instruction Data: Mix of text and vision-language data (approx. 1:1 token ratio). Non-reasoning data via human annotation seed -> model generation -> human ranking/refinement. Reasoning data via rejection sampling.
Reasoning Data: Used for Long-CoT SFT and RL. Generated by sampling reasoning trajectories from Kimi k1.5 (2407.15925) on QA pairs, then filtering using reward models and rules (rejection sampling).

Infrastructure and Implementation

Storage & Data Loading: S3-compatible storage. Custom data loading system supports on-the-fly operations (shuffling, mixing, tokenization, packing), random augmentation (preserving coordinates), reproducibility, and high performance via caching. Centralized data management platform.
Parallelism: A 4D strategy (Data (2007.13221), Expert (2201.11992), Pipeline (1901.02959, 2104.04473), Context (2309.16039, 2306.15537)) is used. Pipeline stages are balanced. Context Parallelism splits sequences for long-context training. ZeRO-1 (2002.08880) and Selective Activation Checkpointing (1604.06174, 2205.05198) optimize memory. Achieves ~60% higher throughput than a comparable 7B dense VLM.

Evaluation Highlights

Kimi-VL-A3B (2.8B activated LLM + 0.4B ViT) is compared against efficient VLMs (GPT-4o-mini, Qwen2.5-VL-7B, Gemma-3-12B-IT, DeepSeek-VL2) and reference models (GPT-4o).

Efficiency: Outperforms DeepSeek-VL2 (4.1B+0.4B activated) on most benchmarks despite fewer parameters. Outperforms Qwen2.5-VL-7B (7.6B+0.7B) on 19/24 benchmarks.
Strengths: Shows strong or SOTA performance (among efficient models) in:
- OCR: InfoVQA (83.2), OCRBench (867).
- Math: MathVista (68.7, outperforms GPT-4o).
- Agent: ScreenSpot-Pro (34.5), OSWorld (8.22, > GPT-4o), WindowsAgentArena (10.4, > GPT-4o).
- Long Document/Video: MMLongBench-Doc (35.1), MLVU (74.2, SOTA), LongVideoBench (64.5).
- Video Perception: EgoSchema (78.5, > GPT-4o), VSI-Bench (37.4, > GPT-4o).
- General/Multi-image: MMBench-EN (83.1, = GPT-4o), AI2D (84.9, > GPT-4o), BLINK (57.3).
Kimi-VL-Thinking: The long-CoT SFT and RL enhanced version shows significant gains on reasoning benchmarks (e.g., +15.4 on MathVision). It competes well with much larger models like QVQ-72B-Preview and even Kimi k1.5 on MathVision (36.8) and MathVista (71.3). Performance scales with increased thinking token length at inference time (Figure 10).

Limitations and Future Work

Size: The current size (2.8B activated) limits performance on highly specialized or complex language-dependent tasks.
Reasoning: While strong, reasoning capability hasn't reached theoretical limits for highly intricate tasks.
Long Context: Despite the 128K window, attention layer parameter limits (comparable to a 3B dense model) may restrict performance on extremely long sequences.

Future work includes scaling up model size, expanding pre-training data, and enhancing post-training algorithms (including test-time scaling for the thinking model).

Practical Implications

Efficiency: The MoE architecture offers a path to high performance with lower inference cost (fewer activated parameters) compared to dense models of similar total size or capability.
Native Resolution: MoonViT's ability to handle native resolutions simplifies preprocessing and potentially improves performance on tasks requiring fine visual detail without patching overhead like LLaVA-OneVision (2403.18221).
Long Context: The 128K context window, combined with joint text/multimodal long-context training, makes it suitable for analyzing long documents (PDFs) and videos.
Agent Capabilities: Strong performance on OSWorld and WindowsAgentArena suggests potential for UI automation and interaction tasks.
Training Infrastructure: The 4D parallelism and custom data loading system demonstrate sophisticated engineering required for training such models at scale.
Reasoning Enhancement: The Long-CoT SFT and RL stages show a practical method to boost reasoning abilities specifically, offering a trade-off between inference time (longer thinking) and performance.
Open Source: The release of code and models facilitates research and application development by the community.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - MoonshotAI/Kimi-VL: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities (389 stars)

Tweets

https://twitter.com/HaoningTimothy/status/1910504937550995631

https://twitter.com/gm8xx8/status/1910735334863765627

https://twitter.com/_akhaliq/status/1910618492748849399

https://twitter.com/HuggingPapers/status/1936405086697136463

https://twitter.com/TheTuringPost/status/1911923728918941911

https://twitter.com/ADarmouni/status/1911898247381078412