Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Kwai Keye-VL Technical Report (2507.01949v1)

Published 2 Jul 2025 in cs.CV

Abstract: While Multimodal LLMs (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce \textbf{Kwai Keye-VL}, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode cold-start'' data mixture, which includesthinking'', non-thinking'',auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the \textbf{KC-MMBench}, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.

Summary

  • The paper introduces an 8B-parameter multimodal model that advances short-video understanding through innovative dynamic resolution handling and a multi-stage training pipeline.
  • It leverages a modular architecture combining a Vision Transformer, a language decoder, and dynamic token allocation to preserve fine-grained visual and temporal details.
  • Evaluation on benchmarks such as Video-MMMU and KC-MMBench demonstrates state-of-the-art accuracy and robust multimodal reasoning capabilities.

Kwai Keye-VL: A Technical Analysis of a Multimodal Foundation Model for Short-Video Understanding

Kwai Keye-VL represents a significant advancement in the design and training of Multimodal LLMs (MLLMs) with a focus on dynamic, information-dense short-form video understanding. The model is built upon an 8B-parameter architecture, leveraging a large-scale, high-quality dataset and a multi-stage training pipeline to achieve state-of-the-art performance on both video-centric and general vision-language tasks.

Model Architecture

Keye-VL adopts a modular architecture comprising a Vision Transformer (ViT) initialized from SigLIP-400M-384-14, a randomly initialized MLP projector, and a Qwen3-8B language decoder. A notable architectural innovation is the support for native dynamic resolution in the vision encoder, achieved by interpolating fixed-length position embeddings and integrating 2D Rotary Position Embeddings (RoPE). This enables the model to process images and videos at their original aspect ratios and resolutions, preserving fine-grained visual details and temporal information. For video, a dynamic token allocation strategy is employed, balancing the number of frames and visual tokens to optimize both breadth and depth of perception.

Data Pipeline and Pre-Training

The training corpus exceeds 600B tokens, with a strong emphasis on high-quality video data. The data pipeline incorporates rigorous filtering, re-captioning using advanced MLLMs, and frame-level annotation. Six primary data categories are included: image captioning, OCR/VQA, grounding/counting, interleaved text-image, video understanding, and pure text. Data de-duplication is performed using pHash and minHash techniques, with additional CLIP-based filtering to prevent benchmark contamination.

Pre-training follows a four-stage progressive strategy:

  1. Vision Encoder Adaptation: Continued pre-training of the ViT on native-resolution data using the SigLIP loss, with dynamic resolution and 2D RoPE.
  2. Cross-Modal Alignment: Freezing vision and LLM parameters, optimizing only the projection MLP for robust feature alignment.
  3. Multi-Task Pre-Training: End-to-end optimization on diverse vision-language tasks, enhancing fundamental visual understanding.
  4. Annealing and Model Merging: Fine-tuning on high-quality data and merging models trained on different data mixtures to reduce bias and improve robustness.

Post-Training: Reasoning and Alignment

Post-training is divided into two phases:

  • Foundational Capability Enhancement: Supervised fine-tuning (SFT) and Mixed Preference Optimization (MPO) on a large, diverse set of multimodal QA and preference data, including human-annotated samples.
  • Advanced Reasoning Stimulation: Introduction of a five-mode "cold-start" data mixture—comprising thinking, non-thinking, auto-think, think-with-image (agentic), and high-quality video data—to teach the model when and how to reason. This is followed by Mix-Mode RL (using GRPO) and iterative alignment with rejection sampling and hybrid scoring (rule-based and model-based).

The agentic mode enables the model to generate code for image manipulation and computation, validated in an external sandbox, thus supporting tool-augmented reasoning.

Training Infrastructure

Keye-VL's training infrastructure is optimized for large-scale, heterogeneous multimodal data:

  • Hybrid Parallelism: Data and sequence parallelism with ZeRO optimizer for memory efficiency and communication overlap.
  • Dynamic Load Balancing: FLOP-based global greedy assignment to address computational imbalance from variable input sizes.
  • Sample-Level Auto-Resume: Joint checkpointing of training and data I/O state for robust fault tolerance.
  • vLLM Integration: Customized for rapid sampling and video input compatibility during post-training.

Evaluation and Results

Keye-VL is extensively evaluated on public and proprietary benchmarks:

  • General Vision-Language Tasks: Achieves SOTA or near-SOTA on MMMU, AI2D, ZeroBench, MMVP, and HallusionBench, with strong performance in mathematical reasoning and low hallucination rates.
  • Video Understanding: Outperforms all open-source models on Video-MMMU, LongVideoBench, and MMVU, with a notable 8.7% absolute improvement on Video-MMMU in thinking mode.
  • Short-Video Scenarios: On the newly introduced KC-MMBench (open-sourced), Keye-VL surpasses the next-best model by over 10% in average accuracy, demonstrating clear commercial application value.
  • Internal Human Evaluation: On a fine-grained, multi-dimensional benchmark (covering correctness, comprehensiveness, relevance, fluency, creativity), Keye-VL leads in video and image tasks, particularly in comprehensiveness and creative narration.

Numerical Highlights

  • MMMU (val): 71.4% (Keye-VL) vs. 66.8% (next best)
  • KC-MMBench: 68.03% (Keye-VL) vs. 57.62% (MiMo-VL 7B-RL)
  • Video-MMMU: 57.6% (Keye-VL) vs. 48.9% (InternVL3-8B)
  • MathVista (MINI): 80.7% (Keye-VL) vs. 70.7% (InternVL3-8B)

Analysis and Implications

Keye-VL demonstrates that large-scale, high-quality video data and a carefully staged training pipeline are critical for robust short-video understanding. The integration of mix-mode reasoning, agentic tool use, and RL-based alignment enables the model to flexibly adapt its reasoning depth to task complexity, mitigating over-thinking and improving user experience.

The model's architecture and training strategies are directly applicable to commercial video platforms, content moderation, e-commerce attribute extraction, and video-centric recommendation systems. The open-sourcing of KC-MMBench provides a valuable resource for benchmarking real-world short-video understanding.

Limitations

  • Visual Perception: The model still exhibits errors in dense OCR (especially Chinese), fine-grained recognition, and scene completeness.
  • Temporal Understanding: Challenges remain in coherent temporal action description, cinematic language perception, and precise event localization.
  • Higher-Order Reasoning: Reliability decreases on tasks requiring rigorous logical chains or specialized domain knowledge.
  • Reward Modeling: Reliance on external MLLMs for reward signals introduces cost and reliability concerns.

Future Directions

  • Video Encoder Optimization: Architectural improvements and more efficient video encoding strategies are needed.
  • Enhanced Perceptual Capabilities: Further work is required to close the gap with SOTA models in fine-grained perception and "think with image" abilities.
  • Reward Model Development: More reliable and efficient reward modeling strategies are an open research question.
  • Generalization to Other Languages and Domains: While the model is strong in Chinese and video-centric tasks, broader multilingual and domain adaptation remains an area for exploration.

Conclusion

Kwai Keye-VL sets a new standard for multimodal foundation models in the video era, combining architectural innovations, a massive and diverse data pipeline, and a sophisticated training regimen. Its demonstrated performance on both public and proprietary benchmarks, especially in short-video understanding, positions it as a leading solution for real-world, video-centric AI applications. The technical strategies and evaluation methodologies detailed in this work provide a blueprint for future MLLM development targeting dynamic, multimodal content.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com

Reddit