Empowering Segmentation Ability to Multi-modal Large Language Models (2403.14141v1)

Published 21 Mar 2024 in cs.CV

Abstract: Multi-modal LLMs (MLLMs) can understand image-language prompts and demonstrate impressive reasoning ability. In this paper, we extend MLLMs' output by empowering MLLMs with the segmentation ability. The extended MLLMs can both output language responses to the image-language prompts and segment the regions that the complex question or query in the language prompts focuses on. To this end, the existing work, LISA, enlarges the original word embeddings with an additional segment token and fine-tunes dialogue generation and query-focused segmentation together, where the feature of the segment token is used to prompt the segment-anything model. Although they achieve superior segmentation performance, we observe that the dialogue ability decreases by a large margin compared to the original MLLMs. To maintain the original MLLMs' dialogue ability, we propose a novel MLLMs framework, coined as LLaVASeg, which leverages a chain-of-thought prompting strategy to instruct the MLLMs to segment the target region queried by the user. The MLLMs are first prompted to reason about the simple description of the target region from the complicated user query, then extract the visual attributes of the target region according to the understanding of MLLMs to the image. These visual attributes, such as color and relative locations, are utilized to prompt the downstream segmentation model. Experiments show that the proposed method keeps the original dialogue ability and equips the MLLMs' model with strong reasoning segmentation ability. The code is available at https://github.com/YuqiYang213/LLaVASeg.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (50)

Authors (7)

Yuqi Yang (21 papers)
Peng-Tao Jiang (34 papers)
Jing Wang (740 papers)
Hao Zhang (947 papers)
Kai Zhao (158 papers)
Jinwei Chen (24 papers)
Bo Li (1107 papers)

Citations (1)

View on Semantic Scholar

Tweets

https://twitter.com/CSVisionPapers/status/1771239672561160587

Empowering Segmentation Ability to Multi-modal Large Language Models (2403.14141v1)

Related Papers

Tweets