- The paper introduces an 8B-parameter multimodal model that advances short-video understanding through innovative dynamic resolution handling and a multi-stage training pipeline.
- It leverages a modular architecture combining a Vision Transformer, a language decoder, and dynamic token allocation to preserve fine-grained visual and temporal details.
- Evaluation on benchmarks such as Video-MMMU and KC-MMBench demonstrates state-of-the-art accuracy and robust multimodal reasoning capabilities.
Kwai Keye-VL: A Technical Analysis of a Multimodal Foundation Model for Short-Video Understanding
Kwai Keye-VL represents a significant advancement in the design and training of Multimodal LLMs (MLLMs) with a focus on dynamic, information-dense short-form video understanding. The model is built upon an 8B-parameter architecture, leveraging a large-scale, high-quality dataset and a multi-stage training pipeline to achieve state-of-the-art performance on both video-centric and general vision-language tasks.
Model Architecture
Keye-VL adopts a modular architecture comprising a Vision Transformer (ViT) initialized from SigLIP-400M-384-14, a randomly initialized MLP projector, and a Qwen3-8B language decoder. A notable architectural innovation is the support for native dynamic resolution in the vision encoder, achieved by interpolating fixed-length position embeddings and integrating 2D Rotary Position Embeddings (RoPE). This enables the model to process images and videos at their original aspect ratios and resolutions, preserving fine-grained visual details and temporal information. For video, a dynamic token allocation strategy is employed, balancing the number of frames and visual tokens to optimize both breadth and depth of perception.
Data Pipeline and Pre-Training
The training corpus exceeds 600B tokens, with a strong emphasis on high-quality video data. The data pipeline incorporates rigorous filtering, re-captioning using advanced MLLMs, and frame-level annotation. Six primary data categories are included: image captioning, OCR/VQA, grounding/counting, interleaved text-image, video understanding, and pure text. Data de-duplication is performed using pHash and minHash techniques, with additional CLIP-based filtering to prevent benchmark contamination.
Pre-training follows a four-stage progressive strategy:
- Vision Encoder Adaptation: Continued pre-training of the ViT on native-resolution data using the SigLIP loss, with dynamic resolution and 2D RoPE.
- Cross-Modal Alignment: Freezing vision and LLM parameters, optimizing only the projection MLP for robust feature alignment.
- Multi-Task Pre-Training: End-to-end optimization on diverse vision-language tasks, enhancing fundamental visual understanding.
- Annealing and Model Merging: Fine-tuning on high-quality data and merging models trained on different data mixtures to reduce bias and improve robustness.
Post-Training: Reasoning and Alignment
Post-training is divided into two phases:
- Foundational Capability Enhancement: Supervised fine-tuning (SFT) and Mixed Preference Optimization (MPO) on a large, diverse set of multimodal QA and preference data, including human-annotated samples.
- Advanced Reasoning Stimulation: Introduction of a five-mode "cold-start" data mixture—comprising thinking, non-thinking, auto-think, think-with-image (agentic), and high-quality video data—to teach the model when and how to reason. This is followed by Mix-Mode RL (using GRPO) and iterative alignment with rejection sampling and hybrid scoring (rule-based and model-based).
The agentic mode enables the model to generate code for image manipulation and computation, validated in an external sandbox, thus supporting tool-augmented reasoning.
Training Infrastructure
Keye-VL's training infrastructure is optimized for large-scale, heterogeneous multimodal data:
- Hybrid Parallelism: Data and sequence parallelism with ZeRO optimizer for memory efficiency and communication overlap.
- Dynamic Load Balancing: FLOP-based global greedy assignment to address computational imbalance from variable input sizes.
- Sample-Level Auto-Resume: Joint checkpointing of training and data I/O state for robust fault tolerance.
- vLLM Integration: Customized for rapid sampling and video input compatibility during post-training.
Evaluation and Results
Keye-VL is extensively evaluated on public and proprietary benchmarks:
- General Vision-Language Tasks: Achieves SOTA or near-SOTA on MMMU, AI2D, ZeroBench, MMVP, and HallusionBench, with strong performance in mathematical reasoning and low hallucination rates.
- Video Understanding: Outperforms all open-source models on Video-MMMU, LongVideoBench, and MMVU, with a notable 8.7% absolute improvement on Video-MMMU in thinking mode.
- Short-Video Scenarios: On the newly introduced KC-MMBench (open-sourced), Keye-VL surpasses the next-best model by over 10% in average accuracy, demonstrating clear commercial application value.
- Internal Human Evaluation: On a fine-grained, multi-dimensional benchmark (covering correctness, comprehensiveness, relevance, fluency, creativity), Keye-VL leads in video and image tasks, particularly in comprehensiveness and creative narration.
Numerical Highlights
- MMMU (val): 71.4% (Keye-VL) vs. 66.8% (next best)
- KC-MMBench: 68.03% (Keye-VL) vs. 57.62% (MiMo-VL 7B-RL)
- Video-MMMU: 57.6% (Keye-VL) vs. 48.9% (InternVL3-8B)
- MathVista (MINI): 80.7% (Keye-VL) vs. 70.7% (InternVL3-8B)
Analysis and Implications
Keye-VL demonstrates that large-scale, high-quality video data and a carefully staged training pipeline are critical for robust short-video understanding. The integration of mix-mode reasoning, agentic tool use, and RL-based alignment enables the model to flexibly adapt its reasoning depth to task complexity, mitigating over-thinking and improving user experience.
The model's architecture and training strategies are directly applicable to commercial video platforms, content moderation, e-commerce attribute extraction, and video-centric recommendation systems. The open-sourcing of KC-MMBench provides a valuable resource for benchmarking real-world short-video understanding.
Limitations
- Visual Perception: The model still exhibits errors in dense OCR (especially Chinese), fine-grained recognition, and scene completeness.
- Temporal Understanding: Challenges remain in coherent temporal action description, cinematic language perception, and precise event localization.
- Higher-Order Reasoning: Reliability decreases on tasks requiring rigorous logical chains or specialized domain knowledge.
- Reward Modeling: Reliance on external MLLMs for reward signals introduces cost and reliability concerns.
Future Directions
- Video Encoder Optimization: Architectural improvements and more efficient video encoding strategies are needed.
- Enhanced Perceptual Capabilities: Further work is required to close the gap with SOTA models in fine-grained perception and "think with image" abilities.
- Reward Model Development: More reliable and efficient reward modeling strategies are an open research question.
- Generalization to Other Languages and Domains: While the model is strong in Chinese and video-centric tasks, broader multilingual and domain adaptation remains an area for exploration.
Conclusion
Kwai Keye-VL sets a new standard for multimodal foundation models in the video era, combining architectural innovations, a massive and diverse data pipeline, and a sophisticated training regimen. Its demonstrated performance on both public and proprietary benchmarks, especially in short-video understanding, positions it as a leading solution for real-world, video-centric AI applications. The technical strategies and evaluation methodologies detailed in this work provide a blueprint for future MLLM development targeting dynamic, multimodal content.