Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

126 1

Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models (2405.15574v4)

Published 24 May 2024 in cs.CV

Abstract: The rapid development of large language and vision models (LLVMs) has been driven by advances in visual instruction tuning. Recently, open-source LLVMs have curated high-quality visual instruction tuning datasets and utilized additional vision encoders or multiple computer vision models in order to narrow the performance gap with powerful closed-source LLVMs. These advancements are attributed to multifaceted information required for diverse capabilities, including fundamental image understanding, real-world knowledge about common-sense and non-object concepts (e.g., charts, diagrams, symbols, signs, and math problems), and step-by-step procedures for solving complex questions. Drawing from the multifaceted information, we present a new efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages multifaceted rationale to enhance understanding and answering capabilities. To embed lengthy rationales containing abundant information, we employ the Mamba architecture, capable of processing sequential data with linear time complexity. We introduce a new concept of traversal of rationale that facilitates efficient embedding of rationale. Subsequently, the backbone multimodal LLM (MLM) is trained to generate answers with the aid of rationale. Through these steps, Meteor achieves significant improvements in vision language performances across multiple evaluation benchmarks requiring diverse capabilities, without scaling up the model size or employing additional vision encoders and computer vision models.

PDF HTML Abstract

Overview of Mamba-based Traversal of Rationales in LLVMs

The paper under review introduces a novel large language and vision model (LLVM), referred to as Mamba-based traversal of rationales (Meteor), which seeks to enhance the vision-language performance of LLVMs through the integration of multifaceted rationales. This method does not rely on the scaling up of model sizes nor the inclusion of additional vision encoders or computer vision models during the inference phase. The approach is particularly noteworthy for its introduction of the traversal of rationale mechanism, which utilizes the Mamba architecture for efficient handling of lengthy rationales in linear time complexity.

Methodology

1. Data Curation and Rationales Generation:

The authors compiled a dataset comprising 2.1 million question-answer pairs, derived from various visual instruction tuning datasets. To generate detailed rationales, these question-answer pairs were processed using the Claude Haiku API, and subsequently refined with human review assisted by GPT-4V. This curation resulted in 1.1 million question-rationale-answer triples, which span a wide range of tasks including fundamental image understanding, common-sense knowledge, and complex problem-solving procedures.

2. Model Architecture:

Meteor integrates several key components:

A vision encoder based on the CLIP-L/14 model for extracting visual features.
The Mamba architecture, referred to as Meteor-Mamba, designed for embedding lengthy rationales efficiently.
A backbone multimodal LLM (Meteor-MLM) built upon the InternLM2-7B, which leverages the embedded rationales for answer generation.
Vision and tor projectors to adapt feature dimensions between the vision encoder, Mamba, and MLM components.

Training and Inference

The training process is bifurcated into two principal steps:

Embedding Rationales: Meteor-Mamba is trained to embed the lengthy rationales in an autoregressive manner, utilizing a new concept termed traversal of rationale. This method involves special <tor> tokens, which segment the rationales and ensure effective information passage into Meteor-MLM.
Vision-Language Training: Subsequently, the entire Meteor architecture is trained using the question-answer pairs, enabling the model to generate answers supported by the embedded rationales without explicitly utilizing them during inference.

Results and Evaluation

Meteor demonstrates substantial improvements across multiple benchmarks, including MME, AI2D, MathVista, and MM-Vet. In evaluations against a range of existing open-source LLVMs, Meteor consistently outperformed, showcasing its adeptness in handling diverse tasks that require intricate understanding and rationalization.

For instance, results on the MME benchmark–which involves multifaceted image understanding tasks–highlight Meteor's superior performance, achieving significantly higher scores compared to models like LLaVA-Next-7B and InternLM-XC-7B. Moreover, in challenging evaluations (Tables (a) and (b) of Table 2), Meteor surpasses other state-of-the-art models in complex benchmarks like MMStar and MathVerse, further emphasizing the efficacy of embedding multifaceted rationales.

Implications and Future Directions

The results indicate that embedding multifaceted rationales considerably enhances a model's ability to deal with complex vision-language tasks, making the Meteor architecture a valuable alternative to increasing model sizes or employing additional encoders. This approach also mitigates hallucination in generative models, as evidenced by its performance on POPE and HallusionBench metrics.

Future Developments

While Meteor has demonstrated impressive results with a 7B model, there is potential to adapt similar methodologies for even smaller models (1-3B parameter range), by leveraging advanced techniques such as mixture of depths and layer analysis. This could pave the way for further democratizing access to highly capable vision-LLMs with minimized computational resources. The integration of rationale embedding offers a promising avenue for enhancing the interpretability and robustness of generative AI systems, potentially extending its application to more domains requiring nuanced understanding and reasoning.

PDF Markdown Bookmark Chat (Pro)

References (123)

Authors (4)

Byung-Kwan Lee (14 papers)
Chae Won Kim (10 papers)
Beomchan Park (6 papers)
Yong Man Ro (91 papers)

Citations (15)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1794920194403336630

https://twitter.com/BKLEE_NANO/status/1839007110732124543

YouTube

Show All Videos