CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

Published 21 Apr 2026 in cs.CV and cs.AI | (2604.19648v1)

Abstract: SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces CoCo-SAM3, a framework that decouples multi-class inference using synonym aggregation and semantic evidence calibration to address inter- and intra-class conflicts.
It demonstrates a significant performance boost with mIoU improvements up to 10.7 points over existing methods across eight diverse benchmarks.
The approach enables stable segmentation with minimal computational overhead by leveraging frozen backbones and prompt-invariant linguistic cues.

CoCo-SAM3: Harnessing Concept Conflict for Stable Open-Vocabulary Semantic Segmentation

Motivation and Limitations of SAM3 for Open-Vocabulary Multi-Class Inference

Open-vocabulary semantic segmentation (OVSS) presents a unique challenge: assigning pixel-level semantic labels from a potentially unbounded and dynamically specified set of linguistic concepts, without laborious re-annotation or retraining. The prompt-driven mask generation paradigm of SAM3 directly generates masks from textual prompts, offering expressive open-set capabilities. However, two key deficiencies emerge when this system is applied to multi-class OVSS:

Absence of a Unified Inter-Class Evidence Scale: Independently generated masks for each prompt are not calibrated on a uniform, comparable scale, leading to regions of inter-class overlap and unstable competition for label assignment.
Intra-Class Inconsistency from Naming Diversity: Synonymous descriptions of a single concept activate distinct semantic/spatial patterns, resulting in intra-class evidence drift that further destabilizes inter-class boundaries.

These issues fundamentally arise from the prompt-conditioned inference mechanism of SAM3, which was trained with one-to-one associations between queries and masks. The consequence is pronounced multi-class ambiguity, with masks that both overlap and are sensitive to prompt wording.

Figure 1: Inter-class mask conflicts in vanilla SAM3 (left), and quantitative analysis of controlled competition (right), demonstrating mIoU reductions as competition becomes more ambiguous.

CoCo-SAM3: Methodological Framework

CoCo-SAM3 explicitly decouples multi-class inference into intra-class enhancement and inter-class competition, introducing two principal mechanisms:

Synonym Aggregation: For each concept, an LLM-augmented synonym set expands linguistic representations. Semantic evidence for a concept is computed by matching each synonym with intermediate perception encoder (PE) features and aggregating responses using log-sum-exp, yielding robust, wording-insensitive intra-class scores.
Semantic Evidence Calibration (SEC): Dense PE features at a mid-level layer are L2-normalized and dot-multiplied with text embeddings for each class, producing pixelwise similarity maps. A cross-class softmax normalizes evidence, providing a pixelwise distribution $\pi_c(x)$ over candidate classes, capturing relative affinity on a shared, unified scale. This is combined with SAM3 structural evidence—converted to logits—for joint decision-making.

The final scoring function for each class at each pixel is:

$S_c(x) = \log\left(\frac{P^{\mathrm{sam}_c(x)}}{1 - P^{\mathrm{sam}_c(x)}}\right) + \lambda_{\mathrm{prior}}\log\pi_c(x) + z_c,$

where $z_c$ is the image-level presence logit from SAM3.

Figure 2: System overview. Intra-class synonym aggregation provides robust semantic priors, and unified cross-class calibration enables stable competition for pixelwise assignment.

This framework ensures both strong intra-class coherence and calibrated competition, substantially reducing mask overlap and edge ambiguity.

Empirical Results and Quantitative Analysis

CoCo-SAM3's performance was systematically evaluated on eight OVSS benchmarks, spanning Pascal VOC, Pascal Context, COCO-Stuff, ADE20K, and Cityscapes. Key findings include:

Superior mIoU Across All Benchmarks: CoCo-SAM3 outperforms all prior state-of-the-art, training-free, and training-based methods, including vanilla SAM3, CorrCLIP, and ReME, with an average mIoU improvement of 6.8 points over vanilla SAM3 and 10.7 points over CorrCLIP.
Scalable to Strong Inter-Class Competition: Gains are particularly pronounced under no-background protocols, where inter-class competition is maximal.
Minimal Inference Overhead: The computational cost remains only marginally above that of SAM3, with no need for additional models or training.
Figure 3: Qualitative comparison with CorrCLIP and vanilla SAM3, showing CoCo-SAM3’s robust semantic delineation and reduced inter-class confusion.

Ablation studies validated the contribution of each component:

SEC alone adds up to 7.1 mIoU over the SAM3 baseline.
Synonym Aggregation further improves robustness, particularly for classes with high linguistic diversity.
Choice of PE Layer: Mid-level features strike an optimal balance for semantic-structural alignment; layer 18 outperforms both lower and higher layers.
Figure 4: mIoU varies with the PE layer selected for semantic evidence; a mid-level feature (layer 18) yields optimal performance.

Additional Qualitative Evidence

Extended evaluations across a variety of datasets further confirm the generalization and robustness of CoCo-SAM3’s semantic prior and fusion strategy.

Figure 5: Additional qualitative results on PC59 demonstrate improved coherence and object delineation.

Figure 6: On COCO-S, CoCo-SAM3 yields reduced fragmentation and correct category assignments for closely related classes.

Figure 7: Results on VOC21 indicate suppression of inter-class mask overlap and improved scene semantics.

Figure 8: Qualitative results on Cityscapes showcase stable segmentation across complex urban scenes.

Figure 9: For ADE20K, the method produces accurate, consistent region labeling even under diverse class sets.

Discussion and Implications

The introduction of a unified-scale semantic prior, coupled with prompt-invariant intra-class aggregation, addresses a core impediment in prompt-driven OVSS from foundation models: cross-class comparability and intra-class stability. By disentangling these inference axes, CoCo-SAM3 enables robust multi-class reasoning without retraining or extensive pre/postprocessing. The methodology:

Generalizes to Diverse Class Sets: The system leverages foundation model flexibility for open-world deployment, while mitigating the combinatorial ambiguity inherent to open-vocabulary tasks.
Enables Efficient, Training-Free Deployment: With frozen backbones and manageable inference overhead, it is well-suited for scenarios where rapid adaptation to new categories is required.

Potential future extensions include integration with more general LLM-driven prompt expansions, further fusion with visual priors beyond the perception encoder, or adaptation to video and temporal segmentation challenges.

Conclusion

CoCo-SAM3 advances the promptable segmentation paradigm by resolving concept conflict in open-vocabulary multi-class settings via two orthogonal axes: intra-class evidence aggregation and inter-class competition calibration. Achieving leading segmentation accuracy across benchmarks without retraining or multi-model pipelines, it demonstrates that explicit modeling of concept competition is essential for scaling foundation segmentation models to truly open-world applications (2604.19648).

Markdown Report Issue