Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 91 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 102 tok/s

GPT OSS 120B 462 tok/s Pro

Kimi K2 181 tok/s Pro

2000 character limit reached

Kwai Keye-VL 1.5 Technical Report (2509.01563v2)

Published 1 Sep 2025 in cs.CV

Abstract: In recent years, the development of LLMs has significantly advanced, extending their capabilities to multimodal tasks through Multimodal LLMs (MLLMs). However, video understanding remains a challenging area due to the dynamic and information-dense nature of videos. Existing models struggle with the trade-off between spatial resolution and temporal coverage when processing video content. We present Keye-VL-1.5, which addresses fundamental challenges in video comprehension through three key innovations. First, we introduce a novel Slow-Fast video encoding strategy that dynamically allocates computational resources based on inter-frame similarity, processing key frames with significant visual changes at higher resolution (Slow pathway) while handling relatively static frames with increased temporal coverage at lower resolution (Fast pathway). Second, we implement a progressive four-stage pre-training methodology that systematically extends the model's context length from 8K to 128K tokens, enabling processing of longer videos and more complex visual content. Third, we develop a comprehensive post-training pipeline focusing on reasoning enhancement and human preference alignment, incorporating a 5-step chain-of-thought data construction process, iterative GSPO-based reinforcement learning with progressive prompt hinting for difficult cases, and alignment training. Through extensive evaluation on public benchmarks and rigorous internal human assessment, Keye-VL-1.5 demonstrates significant improvements over existing models, particularly excelling in video understanding tasks while maintaining competitive performance on general multimodal benchmarks.

Collections

Summary

The paper introduces an 8B-parameter multimodal language model featuring a novel SlowFast video encoding strategy, achieving state-of-the-art performance on video benchmarks.
It employs a progressive pre-training pipeline that extends context to 128K tokens and integrates rigorous post-training with reasoning and human-preference alignment.
Empirical evaluations reveal significant improvements in temporal reasoning and precise object localization, validating the model's robust multimodal capabilities.

Kwai Keye-VL-1.5: Technical Overview and Empirical Analysis

Introduction

Kwai Keye-VL-1.5 is an 8B-parameter Multimodal LLM (MLLM) designed to address the persistent challenges in video understanding, including the trade-off between spatial resolution and temporal coverage, long-context processing, and robust multimodal reasoning. The model introduces three principal innovations: (1) a Slow-Fast video encoding strategy for adaptive frame selection and resource allocation, (2) a progressive pre-training pipeline extending context length to 128K tokens, and (3) a comprehensive post-training regime focused on reasoning and human preference alignment. Empirical results demonstrate state-of-the-art performance on video-centric benchmarks and competitive results on general multimodal tasks.

Figure 1: Keye-VL-1.5-8B establishes new state-of-the-art performance among models of similar scale, demonstrating superior results on video-centric benchmarks while maintaining competitive performance on general multimodal and reasoning tasks.

Model Architecture

Keye-VL-1.5 is built upon the Qwen3-8B LLM and leverages a vision encoder initialized from SigLIP-400M-384-14. The architecture supports native dynamic resolution, preserving the original aspect ratio of images via a 14x14 patch sequence. A simple MLP layer merges visual tokens, and 3D Rotary Position Embedding (RoPE) is employed for unified processing of text, image, and video modalities.

Figure 2: The Kwai Keye-VL-1.5 model architecture integrates Qwen3-8B and SigLIP, supporting SlowFast video encoding and native dynamic resolution with 3D RoPE for unified multimodal processing.

Vision Encoder

The vision encoder is adapted for native-resolution processing, using interpolation to extend fixed-length position embeddings and integrating 2D RoPE for improved extrapolation. FlashAttention and NaViT packing are utilized for efficient training across varying image resolutions. The encoder is pre-trained on 500B tokens from DataComp, LAION, CC12M, PD12M, COCO, and proprietary datasets.

Slow-Fast Video Encoding

The Slow-Fast strategy dynamically allocates computational resources based on inter-frame similarity. Key frames with significant changes are processed at higher resolution (Slow pathway), while static frames are handled at lower resolution but higher temporal coverage (Fast pathway). Patch-based similarity functions identify frame types, and token budgets are adaptively allocated.

Figure 3: SlowFast video encoding: Slow processes fewer frames at higher resolution, Fast handles more frames at lower resolution.

Pre-Training Pipeline

The pre-training pipeline consists of four progressive stages:

Image-Text Matching: Cross-modal alignment with frozen vision and LLM parameters, optimizing the projection MLP.
ViT-LLM Alignment: End-to-end optimization with multi-task data, enhancing fundamental visual understanding.
Multi-task Pre-training: Exposure to diverse tasks (captioning, OCR, grounding, VQA, interleaved data).
Annealing and Model Merging: Fine-tuning on high-quality data, extending context length from 8K to 128K tokens, and merging models trained on different data mixtures.
Figure 4: The four-stage progressive pre-training pipeline, culminating in model merging for robustness and bias reduction.

Post-Training: Reasoning and Alignment

Post-training is divided into non-reasoning (SFT + MPO) and reasoning stages. SFT leverages a 7.5M sample pool, with TaskGalaxy providing diverse multimodal tasks. MPO uses paired preference data for optimization. The reward model is trained on Keye-VL-preview outputs, supporting both think and no-think evaluation modes.

LongCoT Cold-Start and Data Generation

A five-step automated pipeline generates high-quality Long Chain-of-Thought (LongCoT) data, integrating automated MLLM generation, dual-level quality assessment, human-in-the-loop refinement, and dynamic quality scoring.

Figure 5: Five-step LongCoT data generation pipeline, ensuring scalability and quality control for reasoning data.

Model Merging

Domain-specific expert models (e.g., OCR, mathematical reasoning) are merged with the general model to address concentrated weaknesses, using SFT and model merging techniques.

Iterative RL and Progressive Hint Sampling

General RL is performed using GSPO, with progressive hint sampling for hard cases. Hints are provided in five levels, from conceptual to complete solution, optimizing the minimal intervention required for successful reasoning.

Figure 6: Rejection sampling in RL phase: iterative RL and SFT-RL strategies yield significant performance improvements.

Training Infrastructure

A heterogeneous hybrid parallel strategy is employed: data parallelism for ViT, and pipeline/tensor/data parallelism for LLM. Dynamic load balancing and a scalable dataloader address computational and I/O bottlenecks, supporting 128K sequence training.

Empirical Evaluation

Benchmark Results

Keye-VL-1.5 achieves SOTA or near-SOTA results on general vision-language, video, and mathematical reasoning benchmarks. On OpenCompass, MMMU, and AI2D, it outperforms all open-source models. In video-centric tasks, it demonstrates a 6.5% absolute improvement on Video-MMMU over competitors.

Figure 7: SlowFast and 2D convolution video encoding strategies compared on VideoMME; Keye-VL-1.5-Base exhibits strong visual understanding across diverse frame numbers and FPS.

Internal Benchmarks

A proprietary evaluation suite assesses correctness, completeness, relevance, fluency, creativity, and robustness. Keye-VL-1.5-8B achieves a composite score of 3.53, outperforming MiMoVL-7B-RL-2508 (+0.13) and Keye-VL-Preview (+0.51), with pronounced gains in reasoning and temporal understanding.

Ablation Studies

Increasing SFT and MPO data volume consistently improves performance. LongCoT cold-start and RL phases yield substantial gains, especially in mathematical reasoning. Model merging with expert models enhances specialized capabilities (e.g., OCR: 84.51% vs. MiMoVL's 81.41%).

Case Studies

Video Grounding: Keye-VL-1.5 pinpoints object appearance with 0.1s precision.
Content Understanding: Accurately correlates video actions and captions for causal inference.
Content Description: Provides detailed scene descriptions, though rare phenomena (e.g., hail) may be misclassified.
Figure 8: Keye-VL-1.5 accurately pinpoints the handbag's appearance in a 26s video, with 0.1s precision.

Figure 9: Keye-VL-1.5 correlates video content and caption, inferring the purpose of the big dog's action.

Figure 10: Keye-VL-1.5 describes a forest scene, correctly identifying precipitation but misclassifying hail.

Implementation Considerations

Resource Requirements: 8B parameters, 128K context length, heterogeneous parallelism, and dynamic load balancing necessitate substantial GPU resources and optimized infrastructure.
Scaling: Context extension and SlowFast encoding enable efficient scaling to longer videos and higher-resolution images.
Limitations: Trade-offs exist between specialization and integration in model merging; language generation fluency lags behind factual accuracy in some domains.
Deployment: Modular architecture and expert model merging facilitate domain adaptation and incremental updates.

Implications and Future Directions

Keye-VL-1.5 demonstrates that adaptive video encoding, long-context processing, and iterative RL with progressive hinting can substantially advance multimodal reasoning and video understanding. The model's architecture and training pipeline provide a blueprint for scalable, domain-adaptive MLLMs. Future work may focus on further improving language generation fluency, expanding expert model integration, and refining reward modeling for nuanced human preference alignment.

Conclusion

Kwai Keye-VL-1.5 establishes a new standard for video-centric multimodal models, combining architectural innovation, progressive training, and robust post-training methodologies. Its empirical results validate the effectiveness of SlowFast encoding, long-context extension, and iterative RL, offering practical solutions for next-generation multimodal AI systems. The model's modularity and adaptability position it as a strong foundation for future research in scalable, domain-specialized multimodal reasoning.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (61)

First 10 authors:

Tweets

https://twitter.com/AdinaYakup/status/1963162120733835714

https://twitter.com/HuggingPapers/status/1964300251839230059

https://twitter.com/AINativeF_zh/status/1963413758631711032

Kwai Keye-VL 1.5 Technical Report (18 points, 4 comments)

alphaXiv

Kwai Keye-VL 1.5 Technical Report (73 likes, 0 questions)