GLaMM: Pixel Grounding Large Multimodal Model (2311.03356v3)

Published 6 Nov 2023 in cs.CV and cs.AI

Abstract: Large Multimodal Models (LMMs) extend LLMs to the vision domain. Initial LMMs used holistic images and text prompts to generate ungrounded textual responses. Recently, region-level LMMs have been used to generate visually grounded responses. However, they are limited to only referring to a single object category at a time, require users to specify the regions, or cannot offer dense pixel-wise object grounding. In this work, we present Grounding LMM (GLaMM), the first model that can generate natural language responses seamlessly intertwined with corresponding object segmentation masks. GLaMM not only grounds objects appearing in the conversations but is flexible enough to accept both textual and optional visual prompts (region of interest) as input. This empowers users to interact with the model at various levels of granularity, both in textual and visual domains. Due to the lack of standard benchmarks for the novel setting of visually Grounded Conversation Generation (GCG), we introduce a comprehensive evaluation protocol with our curated grounded conversations. Our proposed GCG task requires densely grounded concepts in natural scenes at a large-scale. To this end, we propose a densely annotated Grounding-anything Dataset (GranD) using our proposed automated annotation pipeline that encompasses 7.5M unique concepts grounded in a total of 810M regions available with segmentation masks. Besides GCG, GLaMM also performs effectively on several downstream tasks, e.g., referring expression segmentation, image and region-level captioning and vision-language conversations.

PDF Abstract

Overview of GLaMM: Pixel Grounding Large Multimodal Model

The paper introduces the Grounding Large Multimodal Model (GLaMM), an innovative approach in the field of Large Multimodal Models (LMMs) designed to extend language functionality to the vision domain. GLaMM fundamentally addresses the limitations of current LMMs in generating visually grounded natural language responses intertwined with object segmentation masks. This approach offers nuanced interactive capabilities, allowing flexible interaction with both text and optional visual prompts.

Key Contributions and Novel Task

Grounded Conversation Generation (GCG) Task: The paper presents the novel task of Grounded Conversation Generation (GCG), which compels the model to produce natural language descriptions interleaved with segmentation masks that correspond to described objects or scene elements. This task poses a distinctive challenge by requiring models to integrate pixel-level grounded responses with conversational abilities, uniting typically isolated tasks such as referring expression segmentation, region-level captioning, and vision-language interactions.
GranD Dataset: Given the novel demands of the GCG task, the authors have curated the Grounding-anything Dataset (GranD), a large-scale annotated dataset comprising 7.5 million unique concepts grounded across 810 million regions with segmentation masks. This dataset serves as both a training resource and a benchmark for evaluating models in visually grounded conversation.
Flexible Architecture: GLaMM's architecture comprises components that facilitate scene-level, region-level, and pixel-level understanding. With a combination of encoders, a LLM, and projection layers, the architecture supports the generation of responses with detailed pixel-level groundings, bridging the gap between language and vision-based tasks.

Detailed Methodology

The methodology revolves around constructing detailed visual scenes through pre-established hierarchical blocks—global image encoding, regional encoding, and pixel decoding—each tailored to overcome articulated limitations. By integrating advanced vision models such as a SAM-based encoder, GLaMM achieves detailed region and scene understanding. It can interpret user-defined visual prompts and generate responsive segmentation masks through an end-to-end training approach.

Comprehensive Evaluation Protocol

To validate the GCG task, a rigorous evaluation protocol is established. GLaMM's generated responses are evaluated against key metrics:

Caption Quality: Using METEOR and CIDEr metrics to assess the textual output's coherence and informativeness.
Mask-to-Phrase Correspondence Accuracy: Evaluating how well segmented outputs align with referenced phrases.
Segmentation Quality: Utilizing mask IoU and class-agnostic mask AP.
Grounding Capability: Mask recall assesses the model's ability to ground phrases within specified regions accurately.

Comparison and Performance

Through comparative analysis, GLaMM demonstrates enhanced performance over existing models like Kosmos-2 and LISA, particularly in tasks requiring multi-turn conversations interwoven with pixel-level grounded outputs. These capabilities underscore the model's foundational advantages in facilitating comprehensive multimodal interactions.

Future Implications and Developments

The research speculates significant implications for AI domains involving spatial awareness, interactive agents, and localization tasks. With prospects for expanding to additional modalities beyond vision, such as video and 3D, this framework offers substantial pathways for advancing interactive AI applications.

In summary, the paper embodies a robust framework addressing the stringent demands of grounded multi-modal interactions with novel architectural components and evaluation methodologies, setting a foundational benchmark in the intersection of vision and language processing.

PDF Markdown Bookmark Chat (Pro)

Authors (10)

Hanoona Rasheed (13 papers)
Muhammad Maaz (23 papers)
Abdelrahman Shaker (14 papers)
Salman Khan (244 papers)
Hisham Cholakkal (78 papers)
Rao M. Anwer (4 papers)
Erix Xing (2 papers)
Ming-Hsuan Yang (377 papers)
Fahad S. Khan (5 papers)
Sahal Shaji Mullappilly (9 papers)

Citations (119)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/alexvoica/status/1775242707440382237

YouTube

Show All Videos