Keye-VL-1.5: Advanced Multimodal Video Model

Updated 3 September 2025

Keye-VL-1.5 is a multimodal large language model designed for advanced video understanding with dynamic slow–fast encoding and ultra-long context processing.
The model leverages a progressive four-stage pre-training and post-training pipeline, incorporating techniques like chain-of-thought reasoning and reinforcement learning for superior spatial and temporal precision.
It efficiently allocates computational resources and surpasses benchmarks by balancing high-resolution spatial processing and granular temporal modeling, making it state-of-the-art in video analytics.

Keye-VL-1.5 is a multimodal LLM specifically designed for advanced video understanding tasks, featuring architectural, pre-training, and post-training innovations that address limitations in spatial-temporal reasoning and context length seen in prior MLLMs. Built upon the Qwen3-8B backbone, Keye-VL-1.5 introduces dynamic video encoding, progressive context extension to 128K tokens, and sophisticated reasoning and human alignment strategies. The model demonstrates significant improvements in handling information-dense, dynamic video inputs relative to previous Keye-VL variants and contemporary competitors.

1. Dynamic Slow–Fast Video Encoding Strategy

Keye-VL-1.5 adopts a dual-pathway "Slow–Fast" video encoding scheme to efficiently balance spatial resolution and temporal coverage. The mechanism operates as follows:

Frame Selection and Pathways: The encoding pipeline analyzes inter-frame similarity using a patch-based similarity metric, assigning each frame to either a "slow" or "fast" pathway.
- The first frame is always marked as "slow."
- For each subsequent frame, its similarity (computed patch-wise) with the last slow frame is assessed: frames showing >95% similarity are categorized as "fast," others as "slow."
- Slow frames are processed at high spatial resolution with a higher token budget; fast frames are low-resolution with a token budget approximately 30% that of slow frames.
Token Budget Allocation: To meet global computational limits—e.g., a 75,000-token cap for video input—a binary search allocates individual token budgets, ensuring slow frames are adequately resourced.
Specialized Encoding: Additional special tokens and absolute timestamp position encodings distinguish boundaries and facilitate temporal reasoning across slow and fast segments.

This architecture improves over fixed-frame sampling schemes by flexibly adapting computational focus. Slow frames, representing significant visual changes, receive enhanced spatial attention while fast frames ensure granular temporal modeling.

2. Progressive Four-Stage Pre-Training with Ultra-Long Context

Training protocol for Keye-VL-1.5 utilizes a staged methodology that systematically increases the model's context length:

Stage 1: Cross-Modal Alignment The Qwen3-8B base is frozen; only the projection MLP is optimized using large-scale image–text data, enabling initial vision-language cross-modal mapping.
Stage 2: Multi-Task Pre-Training All model parameters are unfrozen for end-to-end training. Tasks include image captioning, optical character recognition, grounding, VQA, and multi-modal sequence processing, supporting broad perceptual coverage and semantic alignment.
Stage 3: Annealing on High-Quality Data The model is fine-tuned via lower learning rates on curated high-quality samples, refining predictive precision and mitigating overfitting from the large-scale corpus.
Stage 4: Sequence Length Extension In the final phase, the context window is increased from 8,192 to 131,072 tokens. RoPE (Rotary Position Embedding) inverse frequency is reset from 1,000,000 up to 8,000,000 to accommodate longer sequences. Training leverages context parallelism and pipeline parallelism, enabling efficient scaling on longer and more information-dense samples.

Empirical token mixture studies suggest an optimal allocation of context tokens—approximately 24% videos, 50% images, 26% text—for balanced multimodal learning.

3. Post-Training: Reasoning Enhancement and Human Preference Alignment

Following pre-training, Keye-VL-1.5 undergoes a dedicated post-training pipeline comprising supervised fine-tuning, preference optimization, chain-of-thought data augmentation, and reinforcement learning:

Non-Reasoning Stage (SFT + MPO):

The model is fine-tuned on over 7.5 million multimodal QA samples covering dialogue, grounding, counting, and interface understanding. Mixed preference optimization (MPO) is used to reinforce preferred responses by leveraging reward signals on paired high- and low-quality outputs.

Five-Step Chain-of-Thought (CoT) Generation:

An automated pipeline yields high-quality LongCoT reasoning data: 1. Automated sampling of diverse prompts and candidate solution paths via MLLMs 2. Step-wise evaluation by an MLLM judge assessing final answers and reasoning chains 3. Quality tiering (A–high, B–moderate, C–discard), with human-in-the-loop refinement for ambiguous cases 4. Application of dynamic quality scoring (1-5 scale) to emphasize superior samples 5. Incorporation of high-scoring LongCoT examples in post-training

Reinforcement Learning with GSPO:

The Group Sequence Policy Optimization (GSPO) algorithm is applied at the sequence level. The simplified objective:

$J_{GSPO}(\theta) = \mathbb{E}_x \left[ \frac{1}{G} \sum_i \min( s_i(\theta)^\hat{A} \cdot \hat{A}_i, \text{clip}(s_i(\theta), 1-\varepsilon, 1+\varepsilon) \cdot \hat{A}_i ) \right]$

where $s_i(\theta)$ is the sequence likelihood ratio, and $\hat{A}_i$ is a group-based advantage estimate from reward signals. For challenging examples, hints escalate through five levels—from conceptual cues to procedural detail—so that model reliance on external assistance decreases over time.

Alignment RL Phase:

An alignment RL stage incorporates rule-based, generative, and model-based rewards to ensure outputs conform to instruction-following and stylistic preferences.

4. Benchmarking, Ablations, and Evaluation Protocols

Validation of Keye-VL-1.5 spans both public benchmarks and internal human assessments:

Public Benchmarking:

The model is assessed on vision-language and video-centric benchmarks such as OpenCompass, MMMU, AI2D, MMBench, Video-MMMU, TempCompass, and LongVideoBench. Keye-VL-1.5 delivers state-of-the-art video comprehension, surpassing competitors in scenario-specific accuracy and maintaining competitive performance on static-image tasks.

Internal Human Evaluation:

Comprehensive scoring considers correctness, completeness, relevance, fluency, and creativity. Composite scores indicate a +0.5 point improvement over prior Keye-VL-Preview and an approximately +1 point gain for fine-grained reasoning. Temporal reasoning and object grounding achieve high precision, frequently accurate to within 0.1 seconds.

Ablations and Case Studies:

Targeted ablations clarify the contributions of progressive training, LongCoT data augmentation, RL (general and alignment phases). Case analyses affirm the advantages in precise temporal localization and interpretative scene understanding.

5. Comparative Position and Implications

Keye-VL-1.5 distinguishes itself among contemporary MLLMs via the following:

Superior Video Reasoning:

Later inflection points and more efficient visual token adaptation to variable frame counts give Keye-VL-1.5 an edge in video-specific tasks.

Flexible Computational Allocation:

Slow–Fast encoding improves efficiency in handling long, information-dense video streams, a notable enhancement over fixed frame-rate strategies.

Extended Context Processing:

The capacity for 128K-token context windows enables the model to process substantially longer and more complex multimodal documents and videos.

Enhanced Long-Form Reasoning and Alignment:

Extended chain-of-thought generation, GSPO-based RL, and dynamic hinting strategies contribute to robust interpretative and instruction-following abilities.

A plausible implication is that Keye-VL-1.5’s design will facilitate further research in applications requiring temporal precision, video-centric multimodal reasoning, and scalable long-context document processing.

6. Significance and Future Directions

Keye-VL-1.5’s architecture and training pipeline represent notable advancements in both video understanding and multimodal model scalability. Its chain-of-thought reasoning protocol, with minimal intervention and dynamic hinting, may serve as a template for future human-aligned LLM post-training. Context window extensions and token adaptation schemes provide a technical basis for further scaling of MLLMs in fields such as video analytics, temporally anchored retrieval, and large-document comprehension.

Continued evaluation on emerging datasets and competitive benchmarks will be needed to quantify the generalizability of Keye-VL-1.5’s innovations. There remain open questions regarding efficient deployment for real-time or resource-constrained settings, optimal token mixture ratios, and the boundaries of context length scaling. The approach paves a pathway for next-generation multimodal models targeting nuanced, temporally complex tasks in research and industry.

PDF Markdown Chat (Pro)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Keye-VL-1.5.