Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LISA: Reasoning Segmentation via Large Language Model (2308.00692v3)

Published 1 Aug 2023 in cs.CV
LISA: Reasoning Segmentation via Large Language Model

Abstract: Although perception systems have made remarkable advancements in recent years, they still rely on explicit human instruction or pre-defined categories to identify the target objects before executing visual recognition tasks. Such systems cannot actively reason and comprehend implicit user intention. In this work, we propose a new segmentation task -- reasoning segmentation. The task is designed to output a segmentation mask given a complex and implicit query text. Furthermore, we establish a benchmark comprising over one thousand image-instruction-mask data samples, incorporating intricate reasoning and world knowledge for evaluation purposes. Finally, we present LISA: large Language Instructed Segmentation Assistant, which inherits the language generation capabilities of multimodal LLMs while also possessing the ability to produce segmentation masks. We expand the original vocabulary with a <SEG> token and propose the embedding-as-mask paradigm to unlock the segmentation capability. Remarkably, LISA can handle cases involving complex reasoning and world knowledge. Also, it demonstrates robust zero-shot capability when trained exclusively on reasoning-free datasets. In addition, fine-tuning the model with merely 239 reasoning segmentation data samples results in further performance enhancement. Both quantitative and qualitative experiments show our method effectively unlocks new reasoning segmentation capabilities for multimodal LLMs. Code, models, and data are available at https://github.com/dvlab-research/LISA.

Analysis of "LISA: Reasoning Segmentation via LLM"

This paper, titled "LISA: Reasoning Segmentation via LLM," presents an advanced framework for segmentation tasks that necessitates intricate reasoning and comprehension beyond explicit human instructions. The authors identify a significant limitation in current perception systems, which typically require explicit user input to identify target objects or categories. In response, they propose the novel task of reasoning segmentation, aiming to generate a binary segmentation mask from implicit and complex query texts.

Key Contributions

The paper's significant contributions can be summarized as follows:

  1. Introduction of Reasoning Segmentation Task: The authors delineate a new segmentation task that requires generating a segmentation mask based on implicit and complex query texts. This task is fundamentally different from traditional referring segmentation tasks, which rely on explicit descriptions.
  2. ReasonSeg Benchmark: To validate their approach, the authors construct the ReasonSeg benchmark consisting of over 1000 image-instruction pairs. This benchmark serves as a comprehensive evaluation metric, promoting further research in the domain.
  3. LISA Model: They develop the Large Language Instructed Segmentation Assistant (LISA), integrating the reasoning capabilities of multi-modal LLMs with segmentation abilities. Noteworthy is the embedding-as-mask paradigm enabling LLMs to produce segmentation masks.
  4. Zero-shot and Fine-tuning Capabilities: LISA exhibits strong zero-shot performance on reasoning segmentation tasks when trained on standard datasets without explicit reasoning segmentation samples. Further performance enhancements are observed with minimal fine-tuning on a small subset of reasoning segmentation pairs.

Methodological Innovations

The proposed LISA model leverages a robust multi-modal LLM (specifically, LLaVA) and incorporates a vision backbone capable of segmentation, such as SAM. The introduction of a new token, <SEG>, into the LLM's vocabulary enables it to produce segmentation masks. Upon generating the <SEG> token, its hidden embedding is decoded into a segmentation mask using an MLP projection layer and a dedicated decoder. This embedding-as-mask technique facilitates a seamless integration of segmentation capabilities into the LLM framework.

Training LISA involves standard semantic segmentation datasets, referring segmentation datasets, and visual question answering datasets. This diversified training regime ensures the model's robustness and efficacy across different tasks, emphasizing the model's ability to generalize and comprehend complex instructions.

Experimental Results

Reasoning Segmentation

LISA significantly outperforms previous methods across various metrics in the reasoning segmentation benchmark. For instance, LISA-13B achieves substantial improvements with an overall gIoU of 44.8% and can be further enhanced to 51.7% with fine-tuning. These results underscore the model’s capability in handling complex queries involving reasoning and world knowledge.

Vanilla Referring Segmentation

In traditional referring segmentation tasks, LISA also demonstrates superior performance. For example, LISA-7B achieves top scores on the refCOCO, refCOCO+, and refCOCOg datasets, outperforming state-of-the-art methods.

Ablation Studies

The extensive ablation studies reveal several critical insights:

  • Vision Backbone: SAM, pre-trained on a vast dataset, provides superior results compared to Mask2Former. However, the framework's flexibility allows for various vision backbones.
  • Projection Layer: Employing an MLP for the projection layer is beneficial, slightly outperforming a linear projection layer.
  • Training Data Diversity: Including multi-class labels and different dataset types during training significantly enhances the performance. The combination of semantic segmentation datasets, referring segmentation datasets, and VQA datasets ensures comprehensive model training.

Implications and Future Work

The innovative embedding-as-mask paradigm presents a robust framework for integrating segmentation capabilities into multi-modal LLMs, paving the way for future research in developing intelligent perception systems capable of handling complex tasks. Such systems have significant potential applications in robotics, where nuanced understanding and execution of tasks based on implicit human instructions are paramount.

Future work could explore extending LISA's capabilities to handle more diverse and complex tasks, potentially involving dynamic and real-time environments. Additionally, enhancing the efficiency of fine-tuning strategies and reducing the computational overhead would be beneficial for broader applications.

In summary, this paper marks a notable advancement in the intersection of LLMs and visual perception tasks, providing a comprehensive framework for reasoning segmentation and setting a benchmark for future research in this domain.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Xin Lai (24 papers)
  2. Zhuotao Tian (38 papers)
  3. Yukang Chen (43 papers)
  4. Yanwei Li (36 papers)
  5. Yuhui Yuan (42 papers)
  6. Shu Liu (146 papers)
  7. Jiaya Jia (162 papers)
Citations (263)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com