Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models (2505.20444v1)

Published 26 May 2025 in cs.LG and cs.CV

Abstract: Vision-LLMs (VLMs) have made significant progress in multimodal tasks. However, their performance often deteriorates in long-context scenarios, particularly long videos. While Rotary Position Embedding (RoPE) has been widely adopted for length generalization in LLMs, extending vanilla RoPE to capture the intricate spatial-temporal dependencies in videos remains an unsolved challenge. Existing methods typically allocate different frequencies within RoPE to encode 3D positional information. However, these allocation strategies mainly rely on heuristics, lacking in-depth theoretical analysis. In this paper, we first study how different allocation strategies impact the long-context capabilities of VLMs. Our analysis reveals that current multimodal RoPEs fail to reliably capture semantic similarities over extended contexts. To address this issue, we propose HoPE, a Hybrid of Position Embedding designed to improve the long-context capabilities of VLMs. HoPE introduces a hybrid frequency allocation strategy for reliable semantic modeling over arbitrarily long context, and a dynamic temporal scaling mechanism to facilitate robust learning and flexible inference across diverse context lengths. Extensive experiments across four video benchmarks on long video understanding and retrieval tasks demonstrate that HoPE consistently outperforms existing methods, confirming its effectiveness. Code is available at https://github.com/hrlics/HoPE.

Summary

HoPE: Hybrid of Position Embedding for Length Generalization in Vision-LLMs

This paper introduces HoPE (Hybrid of Position Embedding), a novel method aimed at enhancing the performance of Vision-LLMs (VLMs) when dealing with long-context scenarios, specifically in multimodal tasks involving lengthy videos. Vision-LLMs have shown notable success in various applications but encounter considerable performance degradation as the input length increases beyond pre-trained limits. The goal of the paper is to address this limitation by improving how positional information is embedded.

Current embedding methods like Rotary Position Embedding (RoPE) have proved effective for text-based LLMs. However, adapting RoPE for VLMs, especially to model spatial-temporal dependencies in videos, presents several challenges. The authors point out that existing strategies for extending RoPE to handle video inputs are largely heuristic, lacking rigorous theoretical underpinning. Typically, these methods allocate different frequencies within RoPE to encode spatial and temporal information in video data. However, they have been unable to efficiently capture semantic similarities within extended contexts.

HoPE introduces a hybrid approach to frequency allocation in positional embeddings to improve the semantic modeling over long contexts. The technique strategically sets the lowest frequencies used for temporal modeling to zero, aiming to guarantee reliable semantic representation. This allocation strategy is supported by a theoretical analysis within the paper, which highlights flaws in previous frequency allocation methods in RoPE variants. Additionally, HoPE incorporates a dynamic temporal scaling mechanism, allowing the model to adjust input video speeds on-the-fly, enhancing learning robustness and inference flexibility across varying video lengths.

The paper presents extensive empirical evaluations, demonstrating HoPE’s superiority over existing multimodal RoPE variants. In benchmarks involving long video understanding tasks, HoPE consistently outperformed existing models across varying context lengths. Notably, improvements of up to 22.23% were reported in long video retrieval tasks. These results affirm the proposed method's efficacy in addressing critical limitations faced by current VLMs when engaged in long-context video tasks.

The paper's theoretical contributions and comprehensive empirical evaluations suggest significant implications for future research in AI. The hybrid frequency allocation strategy not only provides a blueprint for improving positional embeddings in VLMs but also serves as a foundation for developing new methods for multimodal integration. These enhancements have the potential to broaden the application of VLMs to real-world scenarios where input data varies significantly in length and complexity.

In conclusion, HoPE offers a promising approach to length generalization in Vision-LLMs by enhancing spatial-temporal modeling capabilities. Future developments may explore incorporating these techniques into larger-scale models, extending their applicability across even more extensive datasets and complex tasks. As VLMs continue to advance, solutions like HoPE are poised to play a crucial role in overcoming the challenges associated with multimodal long-context information processing.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets