Kwai Keye-VL-1.5: Multimodal Video Model
- Kwai Keye-VL-1.5 is a state-of-the-art multimodal model that integrates adaptive slow–fast video encoding and extends context up to 128K tokens for detailed video understanding.
- It employs a four-stage pre-training methodology and a comprehensive post-training pipeline, including chain-of-thought data and reinforcement learning, to boost reasoning and human alignment.
- The model demonstrates competitive performance on both general vision-language tasks and video-specific benchmarks through innovative frame sampling and efficient token allocation.
Kwai Keye-VL-1.5 is a multimodal LLM (MLLM) engineered for state-of-the-art video understanding and robust multimodal reasoning. Building upon the original Keye-VL system, version 1.5 integrates several architectural and training innovations to overcome persistent challenges in comprehending dynamic, information-rich video content. The design leverages adaptive video encoding, context extension to 128K tokens, and an extensive post-training pipeline focused on reasoning enhancement and human alignment. Comprehensive evaluations validate substantial improvements in both video-centric and general multimodal benchmarks (Yang et al., 1 Sep 2025).
1. Adaptive Slow–Fast Video Encoding
Keye-VL-1.5 introduces a Slow–Fast video encoding strategy to mitigate the trade-off between spatial resolution and temporal coverage, an enduring limitation in MLLM video processing. The method segments input video frames into two distinct pathways determined by inter-frame visual similarity:
- Slow Pathway: Frames exhibiting “significant” visual change are retained at higher spatial resolution and lower temporal density, preserving critical fine-grained details. Sampling is intentionally sparse, but each selected frame is allocated a larger token budget.
- Fast Pathway: Frames highly similar to the most recent slow frame (i.e., >95% patch-based similarity) are treated as temporally dense but spatially coarse representations. These “fast” frames are encoded at reduced resolution and with 30% of the token count allotted to each slow frame.
A binary search algorithm is adopted to optimally distribute a fixed token budget (e.g., 75,000 visual tokens) across both pathways, enhancing both efficiency and representational fidelity. Special tokens, including explicit absolute timestamps, are injected to denote slow–fast boundaries and temporal segmentation, enabling effective downstream conditioning on temporal structure.
This adaptive allocation allows Keye-VL-1.5 to process longer, more complex video streams without exceeding available computational resources, providing expanded temporal awareness while preserving key spatial features.
2. Staged Pre-Training with Long-Context Extension
The model’s training regimen employs a progressive four-stage pre-training methodology, specifically designed to bootstrap robust cross-modal alignment while gradually scaling the model's context capacity from 8K to 128K tokens:
- Cross-Modal Alignment: The LLM (Qwen3-8B) and vision encoder (SigLIP-400M-384-14) are frozen except for the projection MLP. Training on large-scale image–text data implements initial visual–linguistic token alignment.
- Multi-Task Pre-Training: With all parameters unfrozen, the model is trained on a wide variety of multimodal tasks (e.g., image captioning, OCR, grounding, visual QA) using context windows up to approximately 8K tokens. Data parallelism and ZeRO-2 optimization are employed for efficiency.
- Annealing on High-Quality Data: A fine-tuning phase addresses noise from earlier broad-spectrum data by focusing training on thoroughly curated, high-quality datasets.
- Context Window Extension: During annealing, the model’s context window is extended to 128K tokens. To achieve this, the frequency of Rotary Position Embedding (RoPE) is reset (from 1,000,000 to 8,000,000), and memory management strategies shift to ZeRO-1, context, and pipeline parallelism, supporting the extended sequence length without incurring prohibitive memory costs.
This progressive extension strategy prevents performance degradation associated with abrupt increases in context size and enables effective modeling of long-range dependencies in video and text sequences.
3. Comprehensive Post-Training and Reasoning Enhancement
Post-training in Keye-VL-1.5 is designed to increase both reasoning capability and alignment with human preferences through a multi-component pipeline:
A. Non-Reasoning Stage
- Supervised Fine-Tuning (SFT): Over 7.5 million multimodal QA samples are used for SFT. Data augmentation (e.g., multiple QA pairs, “trap” questions) discourages the development of trivial caption-only behaviors.
- Mixed Preference Optimization (MPO): The model is further refined via a paired-sample dataset containing hundreds of thousands of human-annotated and open-source QA pairs, optimizing preference for accuracy and completeness.
B. Reasoning Enhancement Stage
- 5-Step Chain-of-Thought (CoT) Data Construction: The “LongCoT” dataset is created via a five-step generation and quality control process using MLLM generations and a two-level human-in-the-loop assessment, resulting in stratified quality labels (A, B, C). Samples at the margin (B, A) are further refined through targeted human review.
- GSPO-Based Reinforcement Learning: The Group Sequence Policy Optimization (GSPO) algorithm is implemented, optimizing the expected utility:
where is the importance ratio and is the rollout-specific advantage. Progressive prompt hinting is applied: for failure cases, the minimal level of hint (conceptual cue, strategic advice, tool reference, procedural step, or full solution) eliciting a correct response is recorded and used as training signal.
- Alignment RL: Final alignment employs reinforcement learning with rewards from three sources—rule-based, generative (distance to gold standard), and model-based (preferences learned from a dedicated preference model)—ensuring conformity in formatting, completeness, and human-preferred style.
4. Evaluation Protocols and Empirical Outcomes
Keye-VL-1.5 has been assessed across a spectrum of public and proprietary benchmarks:
- General Vision-Language Tasks: On standard metrics (OpenCompass, MMMU, AI2D), Keye-VL-1.5 achieved top or near-top scores compared to other 8-billion-parameter models.
- Video Understanding: On video-specific leaderboards (e.g., VideoMME), the Slow–Fast strategy allows the model to absorb longer sequences and demonstrate sustained improvements in accuracy as additional frames are processed, outperforming traditional 2D CNN-based or uniform-token allocation approaches.
- Internal Assessments: Using criteria such as temporal localization (e.g., timestamp inference with 0.1-second precision), recognition of complex behavioral cues, and robust scene description even under ambiguous input, the model exhibits advanced reasoning and perceptual abilities.
The following table summarizes the stages of evaluation and highlights:
Task Type | Benchmark / Setting | Key Outcome |
---|---|---|
General Multimodal | OpenCompass, MMMU, AI2D | State-of-the-art or competitive scores |
Video Understanding | VideoMME | Sustained accuracy with longer videos |
Internal Case Studies | Real-world video, QA, CoT | High temporal and reasoning fidelity |
This profile indicates the model’s advances in both breadth (generalization) and depth (efficacy in video-centric reasoning).
5. Innovations, Limitations, and Significance
Keye-VL-1.5's principal technical contributions include:
- Slow–Fast Video Encoding: Dynamic partitioning of video frames by inter-frame similarity and resolution, optimizing representational efficiency.
- Long-Context Handling: Context extension up to 128K tokens, achieved through staged annealing and RoPE adaptation, enabling the processing of full-length videos and lengthy multi-turn interactions.
- Comprehensive Post-Training: Multi-stage SFT, MPO, and RL pipeline incorporating chain-of-thought and hinting, yielding advanced compositional reasoning and human-centric output formatting.
Central limitations remain in calibrating the “slow–fast” threshold for generic video domains and optimizing computational resource requirements. This suggests further work may improve auto-tuning of similarity thresholds and memory efficiency. The alignment pipeline depends on the quality and coverage of curated hinting and preference data; scaling such pipelines to even larger models or domain-specific settings may pose new challenges.
A plausible implication is that these innovations provide a pathway for effective modeling of long, information-dense video streams in practical MLLM applications, bridging gaps in both temporal and reasoning depth.
6. Relationship to Prior Work and Future Research Trajectories
Keye-VL-1.5 builds directly on the foundational Keye-VL model (Team et al., 2 Jul 2025), extending architectural elements (SigLIP-based vision backbone, Qwen3-8B decoder with 2D/3D RoPE) and training recipe (four-stage pre-training, two-phase post-training with cold-start data mixtures, mix-mode RL). Compared to previous approaches, notably ISLMs and generic MLLMs, Keye-VL-1.5 demonstrates significant improvement in handling temporal granularity and extended context.
Research directions suggested include:
- Extending adaptive video encoding (e.g., multi-scale or content-dependent token allocation).
- Further auto-tuning of long-context optimization strategies for efficiency on next-generation hardware.
- Scaling alignment and hinting datasets for generalized CoT reasoning across more domains.
- Investigating integration with attention-aware quantization techniques (such as AKVQ-VL (Su et al., 25 Jan 2025)) to further mitigate inference memory bottlenecks for long-context, multi-modal tasks.
Keye-VL-1.5 delineates a substantial advance in video-centric multimodal modeling, combining architectural adaptivity, scalable sequence modeling, and nuanced alignment to address practical and benchmarked needs in modern MLLM applications.