Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Kwai Keye-VL: 8B-Param Multimodal LLM

Updated 6 July 2025

Kwai Keye-VL is an 8-billion-parameter multimodal LLM that excels in short-video understanding while integrating general vision-language capabilities.
It features a modular architecture with a Vision Transformer encoder, MLP projector, and a Qwen3-8B language decoder enhanced by 3D RoPE for temporal encoding.
The model is trained via a multi-stage pipeline and post-training reinforcement learning, achieving superior performance on video-centric and image-based benchmarks.

Kwai Keye-VL is an 8-billion-parameter multimodal LLM (MLLM) developed specifically for leading-edge short-video understanding, while maintaining robust general-purpose vision-language abilities. Designed to address the limitations of existing MLLMs on dynamic, information-dense short-form video, Keye-VL combines a massive, video-focused dataset with a multi-stage training and post-training methodology that emphasizes both foundational and advanced reasoning skills (2507.01949).

1. Model Architecture

Kwai Keye-VL employs a modular architecture that integrates three principal components:

Vision Transformer (ViT) Encoder: The vision module is initialized from the open-source SigLIP-400M-384-14 backbone. To accommodate images and videos at their native resolutions, fixed-length position embeddings are interpolated and enhanced with 2D Rotary Position Embeddings (RoPE). This enables spatial arrangement preservation and, with 3D RoPE, temporal relation encoding across video frames.
MLP Projector: A randomly initialized multi-layer perceptron (MLP) projects visual features from the vision encoder into the latent space of the language decoder, facilitating unified cross-modal processing.
Language Decoder: The Qwen3-8B model serves as the language head, bringing strong natural language generation, comprehension, and external world knowledge to the system.

The architecture supports both image and video input. Video frames are divided into grids (e.g., 14×14 patches), processed as sequences, and passed through the MLP, which ensures the temporal alignment and integration before decoding as natural language. The 3D RoPE mechanism links token positions with absolute temporal information to support complex temporal reasoning in videos (e.g., "frame +1" representing "+0.5 seconds").

2. Pre-Training and Data Regimen

Keye-VL is trained using over 600 billion tokens, with a distinctive focus on high-quality video content alongside large volumes of image-caption, OCR, grounding, VQA, and interleaved multimodal tasks.

The pre-training follows a four-stage sequential pipeline:

Cross-Modal Alignment: Vision and language modules are initialized and frozen. The projection MLP is trained to align features one-to-one between visual and textual modalities, providing a stable common representation space for fused learning.
End-to-End Multi-Task Training: All modules are unfrozen. The model is exposed to diverse tasks (captioning, OCR, VQA, grounding, interleaved tasks), which establishes robust perception and reasoning abilities across modalities.
Annealing with High-Quality Data: Training transitions to curated, high-quality datasets, which serve to refine capabilities, particularly for reasoning and visual grounding, and to suppress negative effects from earlier low-quality data exposure.
Model Merging: To counteract distributional bias, weights from models trained with different data mixture ratios are merged (homogeneous-heterogeneous model merging), yielding a robust, diversified parameterization.

This multi-stage recipe ensures gradual acquisition of general visual-linguistic skills before specialized refinement and debiasing.

3. Post-Training: Two-Phase Reasoning Enhancement

Post-training is divided into two distinct phases, each tailored to advance the model’s capabilities for distinct aspects of user interaction and reasoning:

Phase I: Non-Reasoning Supervised Fine-Tuning (SFT) & Mixed Preference Optimization (MPO)
- Over 5 million curated QA samples from 70,000 unique tasks are used to develop stable instruction-following and response generation via the TaskGalaxy framework.
- Mixed Preference Optimization leverages human preferences, paired data, self-improvement, and text-only instruction feedback to ensure broader alignment and reduce superficial errors.
Phase II: Advanced Reasoning with Five-Mode “Cold-Start” Mixture, Reinforcement Learning (RL), and Alignment
- A “cold-start” blend of five data modes—traditional QA, long chain-of-thought (CoT), auto-reasoning, “think with image”, and high-quality video data—is used to instruct the model when to invoke deep, multi-step reasoning versus direct answering.
- RL through the GRPO algorithm rewards both answer accuracy and the internal consistency of the reasoning process.
- Iterative alignment steps involving rejection sampling and further preference optimization correct residual output defects such as repetitive or illogical responses.

The five-mode data mixture is pivotal, directly teaching the model to adapt output depth and reasoning style autonomously, including “thinking” on demand and employing code-based reasoning for image (or video) tasks.

4. Technical Innovations

Several technical advances underpin the efficacy of Keye-VL:

Native-Resolution and 3D RoPE: By interpolating positional embeddings and leveraging 2D/3D RoPE, Keye-VL can consume arbitrary-resolution video and images, and explicitly encode fine temporal relations for accurate short-video comprehension.
Homogeneous-Heterogeneous Model Merging: Weight fusion from differently pre-trained models increases parameter diversity and generalization, addressing the overfitting risk from any fixed data mixture.
Task-Adaptive Reasoning: The model employs explicit modes for choosing whether to engage in chain-of-thought or quick-answer strategies, as trained by the five-mode “cold-start” process.
Reinforcement Learning for Reasoning Path Quality: The RL stage, with mix-mode preference signals, ensures not only correct answers but also the logical soundness and variability (non-repetition) of responses.

Sample LaTeX formulas, such as the Pythagorean relationship

$R^2 = \left(\frac{CD}{2}\right)^2 + r^2 \implies R^2 - r^2 = \left(\frac{CD}{2}\right)^2$

and the semicircle area

$\text{Area} = \frac{1}{2} \pi r^2$

are used in evaluation examples, demonstrating the model’s facility with mathematical and geometric reasoning.

5. Empirical Evaluation

Kwai Keye-VL achieves state-of-the-art performance on a broad array of public and custom benchmarks, particularly excelling in short-form video understanding:

Video-MMMU, TempCompass, LongVideoBench: Keye-VL either matches or surpasses competing models in video-centric reasoning and temporal comprehension.
KC-MMBench: On this internally created Kuaishou Community Multimodal Benchmark, designed for real-world short-video scenarios, Keye-VL achieves an average accuracy of 68.03%, exceeding MiMo-VL 7B-RL’s 57.62%.
General Vision-Language Tasks: The model remains highly competitive on mainstream image-based tasks (MMMU, AI2D, ZeroBench, OCR, math, grounding), as illustrated in the comprehensive results tables.
Human Evaluation: Keye-VL maintains high scores for correctness, relevance, comprehensiveness, fluency, and creativity in static and dynamic tasks, confirming its breadth and depth of capability.

These results underscore the model’s ability to handle both temporally dense video and generic multimodal tasks without performance compromise.

6. Impact and Future Directions

Kwai Keye-VL establishes a technical standard for video-centric MLLMs by:

Providing a scalable blueprint for training large models on massive, video-rich datasets.
Validating staged pre-training, multi-mode post-training, and advanced RL-based alignment as effective strategies for flexible, context-sensitive multimodal reasoning.
Demonstrating that video-specific architecture enhancements (such as 3D RoPE and dynamic patching) directly translate to better temporal and contextual understanding in practice.

A plausible implication is that future video-oriented foundation models will adopt similar staged training curricula, “reasoning mode” selection strategies, and data-efficient merging and alignment techniques. The development of new benchmarks, e.g., KC-MMBench, further catalyzes the creation and evaluation of such systems.

In summary, Kwai Keye-VL represents a comprehensive advance in foundation model design for dynamic short-form video understanding, with an architecture and training regime engineered for both domain-specific excellence and general-purpose versatility (2507.01949).

PDF Markdown Chat (Upgrade)

References (1)

Kwai Keye-VL Technical Report (2025)