Perception Tokens Enhance Visual Reasoning in Multimodal Language Models (2412.03548v2)

Published 4 Dec 2024 in cs.CV, cs.AI, and cs.LG

Abstract: Multimodal LLMs (MLMs) still face challenges in fundamental visual perception tasks where specialized models excel. Tasks requiring reasoning about 3D structures benefit from depth estimation, and reasoning about 2D object instances benefits from object detection. Yet, MLMs can not produce intermediate depth or boxes to reason over. Finetuning MLMs on relevant data doesn't generalize well and outsourcing computation to specialized vision tools is too compute-intensive and memory-inefficient. To address this, we introduce Perception Tokens, intrinsic image representations designed to assist reasoning tasks where language is insufficient. Perception tokens act as auxiliary reasoning tokens, akin to chain-of-thought prompts in LLMs. For example, in a depth-related task, an MLM augmented with perception tokens can reason by generating a depth map as tokens, enabling it to solve the problem effectively. We propose AURORA, a training method that augments MLMs with perception tokens for improved reasoning over visual inputs. AURORA leverages a VQVAE to transform intermediate image representations, such as depth maps into a tokenized format and bounding box tokens, which is then used in a multi-task training framework. AURORA achieves notable improvements across counting benchmarks: +10.8% on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, outperforming finetuning approaches in generalization across datasets. It also improves on relative depth: over +6% on BLINK. With perception tokens, AURORA expands the scope of MLMs beyond language-based reasoning, paving the way for more effective visual reasoning capabilities.

PDF HTML Abstract

Perception Tokens for Enhanced Visual Reasoning in Multimodal LLMs

The paper "Perception Tokens Enhance Visual Reasoning in Multimodal LLMs" presents an innovative approach to augmenting Multimodal LLMs (MLMs) by integrating perception tokens that improve their visual reasoning capabilities. This approach is significant in addressing the persistent challenges MLMs face, particularly in tasks that require nuanced visual perception and reasoning, such as depth estimation and object detection.

Concept of Perception Tokens

The primary contribution of this research is the introduction of Perception Tokens, which serve as intrinsic image representations that MLMs can utilize as intermediate reasoning tokens. These tokens enable MLMs to perform visual tasks beyond their traditional reliance on purely linguistic inputs. The authors draw an analogy to the chain-of-thought prompts used in LLMs, where Perception Tokens assist in reasoning processes that would be insufficient using language alone.

In practice, Perception Tokens are employed to generate depth maps or bounding box representations that facilitate more effective reasoning in visual tasks. The incorporation of these tokens enables MLMs to infer and generate intermediate visual conclusions, which improve their performance on various benchmarks. For instance, the proposed method achieves notable improvements in counting tasks, with a +10.8% increase on BLINK, +11.3% on CVBench, and +8.3% on SEED-Bench, when compared to traditional fine-tuning approaches.

AURORA Framework

To implement Perception Tokens, the paper introduces a training framework named AURORA. This framework employs a Vector Quantized Variational Autoencoder (VQVAE) to convert intermediate image representations into a tokenized format. By integrating these tokens into a multi-task training framework, the model can leverage both depth estimation and bounding box predictions, improving visual reasoning on tasks like relative depth estimation and object counting. AURORA significantly enhances MLM performance by promoting generalization across datasets, demonstrated by a notable 6.4% improvement in relative depth estimation on the BLINK dataset.

Technical Implementation and Evaluation

The research deploys a methodological fine-tuning approach where tokens are introduced into the MLM's vocabulary, expanding the tokenization space to include spatial representations. The models are trained using a curriculum that moves from atomic token predictions to complex multi-step reasoning tasks. Findings reveal that leveraging Perception Tokens allows the MLMs to identify spatial relationships and thereby solve tasks that traditional language-based reasoning cannot adequately address.

The authors conducted rigorous evaluations across several benchmarks to validate their approach, reporting improvements over both closed-source and open-source models. Notably, their method outperformed fine-tuned baselines and standard MLM architectures across varied task complexities, indicating the efficacy and robustness of the Perception Token integration.

Broader Implications and Future Directions

The implications of this research are substantial for the field of AI, particularly in advancing the capability of MLMs to reason over visual inputs with greater efficacy. By enabling MLMs to utilize Perception Tokens, this approach bridges a critical gap in multimodal reasoning, offering a scalable and efficient method to enhance model performance without excessive computational overhead.

Looking forward, the framework set forth by this paper provides a robust foundation for further research into integrating additional forms of perceptual information into MLM architectures. Future research could explore expanding the tokenization vocabulary further to include other types of visual features, potentially unlocking new avenues for complex visual reasoning tasks across a broad range of applications.

In conclusion, this paper provides a compelling contribution to the domain of multimodal learning by offering a novel method to enhance visual reasoning in LLMs. The approaches detailed herein pave the way for future investigations and practical implementations that can leverage perception tokens for more sophisticated visual tasks, ultimately expanding the horizon of what multimodal LLMs can achieve.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Mahtab Bigverdi (5 papers)
Zelun Luo (10 papers)
Cheng-Yu Hsieh (23 papers)
Ethan Shen (5 papers)
Dongping Chen (28 papers)
Linda G. Shapiro (8 papers)
Ranjay Krishna (116 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/MahtabBg/status/1866176463374110925

https://twitter.com/jbohnslav/status/1864800311375954048