Images are Worth Variable Length of Representations (2506.03643v2)

Published 4 Jun 2025 in cs.CV

Abstract: Most existing vision encoders map images into a fixed-length sequence of tokens, overlooking the fact that different images contain varying amounts of information. For example, a visually complex image (e.g., a cluttered room) inherently carries more information and thus deserves more tokens than a simple image (e.g., a blank wall). To address this inefficiency, we propose DOVE, a dynamic vision encoder that produces a variable number of visual tokens (i.e., continuous representation vectors) to reconstruct each image. Our results show that DOVE significantly reduces the average number of tokens while maintaining high reconstruction quality. In several linear probing and downstream multimodal tasks, it outperforms existing autoencoder-based tokenization methods when using far fewer tokens, capturing more expressive semantic features compared to fixed-length encoding. We further extend DOVE with query-conditioned tokenization. By guiding the model to focus on query-relevant regions, it achieves more efficient and targeted semantic extraction. Our code and checkpoints are available at https://dove-encoder.github.io/dove-encoder.

Summary

The paper introduces DOVE, a dynamic vision encoder that produces variable-length image tokens using an adaptive transformer-based approach to enhance reconstruction quality.
The paper demonstrates that DOVE achieves high-fidelity image reconstructions with fewer tokens and outperforms static tokenization methods on classification tasks.
The paper extends its methodology with Q-DOVE, which leverages query-conditioned tokenization to improve semantic extraction for vision-language applications.

Overview of DOVE: A Dynamic Vision Encoder for Variable-Length Image Tokenization

The paper "Images are Worth Variable Length of Representations," presents a novel approach to improving the efficiency of visual tokenization by introducing DOVE, a dynamic vision encoder that adapts the representation length according to image complexity. Unlike traditional vision encoders that use a fixed sequence length for image tokenization, DOVE generates a variable number of visual tokens, offering a more flexible and information-efficient representation of images.

Core Concept and Methodology

DOVE leverages a dynamic token generation mechanism that incorporates a transformer-based model to produce an end-of-sequence (EOS) token, effectively signaling the completion of the tokenization process at any point. This adaptive approach to tokenization is grounded on the premise that different images convey distinct amounts of visual information. For instance, an image with greater detail and complexity may necessitate a larger token count compared to a simpler image with less content.

The model architecture consists of four primary components:

VQGAN Encoder and Decoder: Utilized for image tokenization and reconstruction.
Transformer-Based Dynamic Token Generator: Allows for variable-length token sequence generation ending with an EOS token.
Transformer-Based Token Decoder: Enables decoding of the generated tokens for image reconstruction.

To optimize the process, DOVE employs a joint loss function combining mean squared error (MSE), perceptual, and adversarial (GAN) losses. These allow for high-quality image reconstruction while managing token sequence lengths.

Experimental Evaluations

Reconstruction Quality

The paper demonstrates superior reconstruction capabilities of DOVE compared to static tokenization methods like VQGAN and other dynamic tokenizers such as ALIT. The results show that DOVE achieves high fidelity with significantly fewer tokens, thus underscoring its computational efficiency. The model's flexibility allows it to dynamically allocate more tokens to complex images, maintaining low reconstruction loss across diverse image types.

Classification Tasks

DOVE's representations also exhibit strong performance in linear probing for image classification across multiple datasets, including CIFAR-100 and STL-10. The model consistently outperforms other state-of-the-art tokenization methods, especially at lower token counts, indicating its capability to capture richer semantic features during representation learning.

Advancements with Query-Conditioning

The researchers further extend DOVE to include query-conditioned tokenization, thus introducing Q-DOVE. This variant focuses on generating tokens relevant to a given textual query, and results show even more pronounced efficiency gains with robust semantic extraction. By concentrating resources on task-relevant regions, Q-DOVE achieves a notable improvement in performance on vision-language tasks such as visual question answering (VQA).

Implications and Future Work

DOVE presents a significant shift from the conventional fixed-length tokenization paradigm, emphasizing the need for adaptable representation strategies in computer vision. Its dynamic tokenization approach offers promising implications for efficient multimodal learning, specifically in tasks requiring fine-grained semantic interpretation and image context compression.

Looking forward, the methodology established in DOVE could be extended to further explore dynamic token compression techniques and improved quantization strategies for discrete representation spaces. Additionally, its integration into larger vision-language frameworks hints at potential advancements in areas where computational efficiency is paramount.

Overall, the paper contributes valuable insights into the domain of vision representation learning, paving the way for next-generation models capable of adaptive token management.

PDF Markdown

Related Papers

Find Related Papers

GitHub

DOVE

Tweets

https://twitter.com/ZinengTang/status/1930959065485922486