Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CoMemo: LVLMs Need Image Context with Image Memory (2506.06279v1)

Published 6 Jun 2025 in cs.CV

Abstract: Recent advancements in Large Vision-LLMs built upon LLMs have established aligning visual features with LLM representations as the dominant paradigm. However, inherited LLM architectural designs introduce suboptimal characteristics for multimodal processing. First, LVLMs exhibit a bimodal distribution in attention allocation, leading to the progressive neglect of middle visual content as context expands. Second, conventional positional encoding schemes fail to preserve vital 2D structural relationships when processing dynamic high-resolution images. To address these limitations, we propose CoMemo - a dual-path architecture that combines a Context image path with an image Memory path for visual processing, effectively alleviating visual information neglect. Additionally, we introduce RoPE-DHR, a novel positional encoding mechanism that employs thumbnail-based positional aggregation to maintain 2D spatial awareness while mitigating remote decay in extended sequences. Evaluations across seven benchmarks,including long-context comprehension, multi-image reasoning, and visual question answering, demonstrate CoMemo's superior performance compared to conventional LVLM architectures. Project page is available at https://lalbj.github.io/projects/CoMemo/.

Summary

  • The paper introduces a dual-path architecture that effectively preserves image context and improves positional encoding in LVLMs.
  • Its innovative RoPE-DHR encoding maintains 2D spatial relationships, resulting in performance improvements across various long-context tasks.
  • Extensive evaluations demonstrate significant gains in captioning, long-generation, and multi-image reasoning across seven benchmarks.

CoMemo: Addressing Multimodal Challenges in Large Vision-LLMs

In the evolving landscape of artificial intelligence, Large Vision-LLMs (LVLMs) have emerged as significant contributors to the integration of visual and linguistic data. This paper introduces "CoMemo," a novel architectural approach designed to overcome limitations inherent in existing LVLM paradigms, specifically the issues of visual content neglect and inadequate positional encoding for high-resolution images.

Overview and Objectives

The paper begins by acknowledging the dominant trend of aligning visual features with LLMs for multimodal processing. However, it points out that current LVLM architectures suffer from suboptimal characteristics owing to their inherited LLM designs. Two critical issues are identified:

  1. A bimodal attention distribution that progressively overlooks middle visual content as context expands.
  2. Ineffective preservation of vital 2D structural relationships in conventional positional encoding schemes when dealing with high-resolution images.

The primary objective of CoMemo is to address these issues by providing an advanced framework that enhances visual data processing and maintains spatial awareness, thereby improving the overall performance of LVLMs in various benchmarks.

Architectural Innovations

CoMemo introduces several key innovations:

  • Dual-Path Architecture: CoMemo incorporates a dual-path approach, splitting visual data processing into a "context path" and an "image memory path." This design alleviates the neglect of visual information by allowing the model to handle visual inputs more effectively alongside textual data.
  • RoPE-DHR Encoding: A novel encoding mechanism termed RoPE-DHR is proposed, which uses thumbnail-based positional aggregation to maintain 2D spatial relationships. This method mitigates the decay of influence over long sequences, a significant improvement over conventional encoding schemes.

Methodological Approach

The researchers conducted rigorous evaluations across seven different benchmarks, including long-context comprehension, multi-image reasoning, and visual question answering. CoMemo demonstrated superior performance over conventional LVLM architectures in these areas. Specifically, it showed improvement percentages of 17.2% in captioning, 7.0% in long-generation, and 5.6% in long-context tasks.

Implications and Future Directions

The implications of this research are twofold:

  1. Practical Impact: CoMemo's architecture can be employed in applications requiring sophisticated vision-language integration, such as autonomous driving systems, surveillance, and AI-driven content creation.
  2. Theoretical Advancement: By addressing foundational issues in LVLM architectures, CoMemo sets a precedent for future research focused on improving multimodal interactions and information retention.

Looking forward, CoMemo paves the way for further exploration into balancing dual-path strategies and refining positional encoding techniques. As AI continues to evolve, such advancements will be crucial in developing systems capable of handling increasingly complex and high-dimensional data.

Conclusion

CoMemo represents a pivotal step towards resolving intrinsic challenges in LVLM architectures, offering promising solutions for enhanced multimodal understanding. Its success across varied benchmarks underscores the potential for architectural innovations to drive significant improvements in AI applications, reinforcing the importance of continuous exploration and development in the field of multimodal models.