What You Perceive Is What You Conceive: A Cognition-Inspired Framework for Open Vocabulary Image Segmentation

Published 26 May 2025 in cs.CV | (2505.19569v1)

Abstract: Open vocabulary image segmentation tackles the challenge of recognizing dynamically adjustable, predefined novel categories at inference time by leveraging vision-language alignment. However, existing paradigms typically perform class-agnostic region segmentation followed by category matching, which deviates from the human visual system's process of recognizing objects based on semantic concepts, leading to poor alignment between region segmentation and target concepts. To bridge this gap, we propose a novel Cognition-Inspired Framework for open vocabulary image segmentation that emulates the human visual recognition process: first forming a conceptual understanding of an object, then perceiving its spatial extent. The framework consists of three core components: (1) A Generative Vision-LLM (G-VLM) that mimics human cognition by generating object concepts to provide semantic guidance for region segmentation. (2) A Concept-Aware Visual Enhancer Module that fuses textual concept features with global visual representations, enabling adaptive visual perception based on target concepts. (3) A Cognition-Inspired Decoder that integrates local instance features with G-VLM-provided semantic cues, allowing selective classification over a subset of relevant categories. Extensive experiments demonstrate that our framework achieves significant improvements, reaching $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU on A-150. It further attains $56.2$, $28.2$, $15.4$, $59.2$, $18.7$, and $95.8$ mIoU on Cityscapes, Mapillary Vistas, A-847, PC-59, PC-459, and PAS-20, respectively. In addition, our framework supports vocabulary-free segmentation, offering enhanced flexibility in recognizing unseen categories. Code will be public.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper introduces a cognition-inspired segmentation framework that decomposes recognition into conceptual understanding followed by spatial perception.
It uses a generative vision-language model and concept-aware enhancer to fuse semantic features with visual cues for robust segmentation.
Experimental results show significant gains across datasets, highlighting its versatility in both vocabulary-free and open vocabulary segmentation.

Cognition-Inspired Open Vocabulary Image Segmentation Framework

The paper introduces a novel Cognition-Inspired Framework designed for open vocabulary image segmentation, drawing inspiration from the human visual recognition process. The core idea is to mimic how humans first develop a conceptual understanding of an object and then perceive its spatial extent. This framework addresses the limitations of existing methods that often struggle with aligning region segmentation and target concepts, especially when dealing with dynamically adjustable, predefined novel categories at inference time. The proposed framework demonstrates enhanced flexibility in recognizing unseen categories and achieves significant improvements in segmentation performance across various datasets.

Key Components of the Framework

The Cognition-Inspired Framework emulates the human visual recognition process, encapsulated as "What You Perceive Is What You Conceive," and comprises three core components:

Generative Vision-LLM (G-VLM): This component simulates human cognition by generating object concepts, thereby providing semantic guidance for region segmentation. By using a G-VLM, the framework decomposes the recognition task into recognizing a subset of categories, leveraging prior knowledge to identify global visual concepts within the image.
Concept-Aware Visual Enhancer Module: This module fuses textual concept features with global visual representations, enabling adaptive visual perception based on the identified target concepts. It leverages cross-modal fusion techniques to refine visual features based on semantic understanding.
Cognition-Inspired Decoder: This decoder integrates local instance features with semantic cues provided by the G-VLM, allowing for selective classification over a subset of relevant categories. By sharing attention weights across modalities, the decoder promotes modality-invariant feature learning, ensuring that mask generation is grounded in conceptual understanding, even for novel categories.
Figure 1: Overview of the Cognition-Inspired Framework, which emulates human visual recognition by first conceiving a conceptual understanding of an object and then perceiving its spatial extent.

Implementation Details

The implementation details of the Cognition-Inspired Framework are as follows:

The model utilizes a frozen ConvNeXt-Large visual backbone from OpenCLIP, trained for 50 epochs on the COCO Panoptic dataset.
The Concept-Aware Visual Enhancer consists of $N=6$ stacked layers, while the Cognition-Inspired Mask Decoder consists of $M=9$ stacked layers.
Training involves a combination of binary cross-entropy loss ( $\mathcal{L}_{pixel}$ ), Dice loss ( $\mathcal{L}_{dice}$ ), and cross-entropy loss ( $\mathcal{L}_{cls}$ ), with hyperparameters $\lambda_1 = 2.0$ , $\lambda_2 = 5.0$ , and $\lambda_3 = 5.0$ to balance the different loss terms.
The model is optimized using AdamW with a learning rate of $1e^{-4}$ and a weight decay of $0.05$.
During inference, the framework supports two modes: Vocabulary-Free Mode, which relies solely on concepts generated by the G-VLM, and Open Vocabulary Mode, which reweights category predictions using G-VLM-generated concepts.

Experimental Results

The paper presents extensive experimental results that validate the effectiveness of the proposed framework.

On the A-150 dataset, the framework achieves $27.2$ PQ, $17.0$ mAP, and $35.3$ mIoU using only COCO Panoptic training data.
On Cityscapes, the framework attains $44.1$ PQ, $26.5$ mAP, and $56.2$ mIoU.
The framework also achieves $18.2$ PQ and $28.2$ mIoU on Mapillary Vistas and strong mIoU scores on A-847, PC-59, PC-459, PAS-21, and PAS-20.
In vocabulary-free segmentation mode, the framework achieves $11.6$, $10.5$, $32.7$, $48.6$, and $95.3$ mIoU on A-847, PC-459, A-150, PC-59, and PAS-20, respectively.
Ablation studies confirm the complementary benefits of each module in improving both semantic and instance-level segmentation.
Figure 2: Precision and recall comparison of different vision-LLMs across various datasets.

Vocabulary-Free Segmentation

One notable aspect of the Cognition-Inspired Framework is its support for vocabulary-free segmentation, which eliminates the need for manually defined categories. The framework leverages the G-VLM to generate relevant concepts, enabling segmentation without predefined vocabularies. The results demonstrate that the framework can generalize effectively across diverse scenarios, highlighting its versatility and adaptability.

Analysis of Visual Features

The paper includes a qualitative analysis of the visual features learned by the framework. By visualizing K-means clustering results, the authors demonstrate that the Concept-Aware Visual Enhancer effectively captures contextual cues and maintains semantic consistency, even in cluttered or low-contrast scenarios. The enhanced visual features exhibit stronger spatial structure aggregation and more semantically coherent representations compared to the original global visual features.

Figure 3: Visualization of K-means clustering of visual features with and without the Concept-Aware Visual Enhancer.

Conclusion

The Cognition-Inspired Framework represents a significant advancement in open vocabulary image segmentation. By mimicking human cognitive processes, the framework overcomes the limitations of conventional models and achieves state-of-the-art performance across various benchmarks. The framework's ability to perform vocabulary-free segmentation enhances its adaptability in real-world scenarios, making it a promising approach for future research and applications in computer vision.

Markdown Report Issue