Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 42 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 17 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 101 tok/s Pro
Kimi K2 217 tok/s Pro
GPT OSS 120B 474 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better (2506.09040v1)

Published 10 Jun 2025 in cs.CV, cs.AI, and cs.CL

Abstract: Typical large vision-LLMs (LVLMs) apply autoregressive supervision solely to textual sequences, without fully incorporating the visual modality into the learning process. This results in three key limitations: (1) an inability to utilize images without accompanying captions, (2) the risk that captions omit critical visual details, and (3) the challenge that certain vision-centric content cannot be adequately conveyed through text. As a result, current LVLMs often prioritize vision-to-language alignment while potentially overlooking fine-grained visual information. While some prior works have explored autoregressive image generation, effectively leveraging autoregressive visual supervision to enhance image understanding remains an open challenge. In this paper, we introduce Autoregressive Semantic Visual Reconstruction (ASVR), which enables joint learning of visual and textual modalities within a unified autoregressive framework. We show that autoregressively reconstructing the raw visual appearance of images does not enhance and may even impair multimodal understanding. In contrast, autoregressively reconstructing the semantic representation of images consistently improves comprehension. Notably, we find that even when models are given continuous image features as input, they can effectively reconstruct discrete semantic tokens, resulting in stable and consistent improvements across a wide range of multimodal understanding benchmarks. Our approach delivers significant performance gains across varying data scales (556k-2M) and types of LLM bacbones. Specifically, ASVR improves LLaVA-1.5 by 5% in average scores across 14 multimodal benchmarks. The code is available at https://github.com/AlenjandroWang/ASVR.

Summary

  • The paper introduces ASVR, a framework that autoregressively reconstructs semantic visual tokens to improve multimodal comprehension.
  • It demonstrates that ASVR consistently boosts performance across 14 benchmarks, improving LLaVA-1.5’s average scores by 5%.
  • It highlights that semantic reconstruction significantly outperforms raw appearance supervision, offering a scalable training strategy for LVLMs.

Autoregressive Semantic Visual Reconstruction: Enhancing Multimodal Understanding in Vision-LLMs

The paper "Autoregressive Semantic Visual Reconstruction Helps VLMs Understand Better" introduces Autoregressive Semantic Visual Reconstruction (ASVR), a method aimed at improving multimodal understanding in large vision-LLMs (LVLMs). The main innovation of ASVR lies in its integration of semantic visual supervision within a unified autoregressive framework, facilitating joint learning of visual and textual modalities. This approach addresses limitations in current LVLMs by teaching the models to reconstruct and perceive high-level semantic visual information, thus enhancing their multimodal understanding capabilities.

Key Contributions and Findings

  1. Autoregressive Semantic Visual Supervision: ASVR implements joint learning of visual and textual inputs by autoregressively reconstructing semantic visual tokens, rather than raw visual appearances. Through a pretrained visual semantic tokenizer, the paper demonstrates that semantic reconstruction consistently enhances model comprehension across a range of multimodal benchmarks. This semantic approach encourages models to focus on semantically meaningful aspects of images, which improves image understanding.
  2. Evaluating Effectiveness Across Model Configurations: In controlled experiments, ASVR consistently improved performance across 14 multimodal understanding benchmarks, including general visual question answering (VQA), OCR-based tasks, knowledge-based question answering, visual-centric tasks, and hallucination robustness tests. Notably, ASVR improved LLaVA-1.5's average scores by 5% across these tasks. This validates the robustness and generalizability of ASVR across different data scales, LLM backbone capacities, and input resolutions.
  3. Comparison With Visual Appearance Supervision: The paper highlights that reconstructions based on semantic tokens outperform those based on raw appearance tokens. Appearance-based supervision was found to degrade performance, emphasizing the importance of high-level semantic representation in enhancing multimodal understanding.
  4. Training Recipe for Integrated Supervision: ASVR leverages semantic visual supervision both during pre-training and instruction tuning stages, allowing models to build a coherent visual representation foundation that benefits their understanding of vision-language interactions. This consistent application of visual supervision further enhances alignment with textual semantic spaces, offering scalability and nuanced perceptual capabilities in LVLMs.

Practical and Theoretical Implications

The implications of ASVR are extensive, both practically and theoretically. Practically, ASVR provides a scalable training strategy that can improve the operational efficiency of LVLMs across various multimodal tasks and benchmarks. This can substantially benefit applications such as virtual assistance, automated content analysis, and interactive AI systems where balanced visual and linguistic understanding are crucial.

Theoretically, ASVR introduces a novel perspective on utilizing autoregressive modeling for multimodal systems, suggesting that semantic perception can be a more effective route for improving vision-language comprehension. Its results underscore the importance of integrating high-level semantic visual information in multimodal model training, contributing to ongoing research in developing more perceptive and cognitively capable AI systems.

Speculations on Future Developments

Looking ahead, ASVR opens pathways for further exploration into integrating generative capabilities alongside semantic reconstruction within LVLMs. Future work could expand on combining semantic visual reconstruction with image generation, promoting seamless transitions between understanding and creativity across multimodal AI applications. This duality could foster more interactive and responsive AI systems, enhancing user experiences in digital environments.

In conclusion, ASVR represents a significant step forward in improving multimodal comprehension for LVLMs by implementing semantic visual supervision. Its findings provide valuable insights and methodologies that can inform future advancements in AI-driven vision-language understanding and generation techniques.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com