Papers
Topics
Authors
Recent
Search
2000 character limit reached

CoCo-SAM3: Harnessing Concept Conflict in Open-Vocabulary Semantic Segmentation

Published 21 Apr 2026 in cs.CV and cs.AI | (2604.19648v1)

Abstract: SAM3 advances open-vocabulary semantic segmentation by introducing a prompt-driven mask generation paradigm. However, in multi-class open-vocabulary scenarios, masks generated independently from different category prompts lack a unified and inter-class comparable evidence scale, often resulting in overlapping coverage and unstable competition. Moreover, synonymous expressions of the same concept tend to activate inconsistent semantic and spatial evidence, leading to intra-class drift that exacerbates inter-class conflicts and compromises overall inference stability. To address these issues, we propose CoCo-SAM3 (Concept-Conflict SAM3), which explicitly decouples inference into intra-class enhancement and inter-class competition. Our method first aligns and aggregates evidence from synonymous prompts to strengthen concept consistency. It then performs inter-class competition on a unified comparable scale, enabling direct pixel-wise comparisons among all candidate classes. This mechanism stabilizes multi-class inference and effectively mitigates inter-class conflicts. Without requiring any additional training, CoCo-SAM3 achieves consistent improvements across eight open-vocabulary semantic segmentation benchmarks.

Summary

  • The paper introduces CoCo-SAM3, a framework that decouples multi-class inference using synonym aggregation and semantic evidence calibration to address inter- and intra-class conflicts.
  • It demonstrates a significant performance boost with mIoU improvements up to 10.7 points over existing methods across eight diverse benchmarks.
  • The approach enables stable segmentation with minimal computational overhead by leveraging frozen backbones and prompt-invariant linguistic cues.

CoCo-SAM3: Harnessing Concept Conflict for Stable Open-Vocabulary Semantic Segmentation

Motivation and Limitations of SAM3 for Open-Vocabulary Multi-Class Inference

Open-vocabulary semantic segmentation (OVSS) presents a unique challenge: assigning pixel-level semantic labels from a potentially unbounded and dynamically specified set of linguistic concepts, without laborious re-annotation or retraining. The prompt-driven mask generation paradigm of SAM3 directly generates masks from textual prompts, offering expressive open-set capabilities. However, two key deficiencies emerge when this system is applied to multi-class OVSS:

  1. Absence of a Unified Inter-Class Evidence Scale: Independently generated masks for each prompt are not calibrated on a uniform, comparable scale, leading to regions of inter-class overlap and unstable competition for label assignment.
  2. Intra-Class Inconsistency from Naming Diversity: Synonymous descriptions of a single concept activate distinct semantic/spatial patterns, resulting in intra-class evidence drift that further destabilizes inter-class boundaries.

These issues fundamentally arise from the prompt-conditioned inference mechanism of SAM3, which was trained with one-to-one associations between queries and masks. The consequence is pronounced multi-class ambiguity, with masks that both overlap and are sensitive to prompt wording. Figure 1

Figure 1: Inter-class mask conflicts in vanilla SAM3 (left), and quantitative analysis of controlled competition (right), demonstrating mIoU reductions as competition becomes more ambiguous.

CoCo-SAM3: Methodological Framework

CoCo-SAM3 explicitly decouples multi-class inference into intra-class enhancement and inter-class competition, introducing two principal mechanisms:

  • Synonym Aggregation: For each concept, an LLM-augmented synonym set expands linguistic representations. Semantic evidence for a concept is computed by matching each synonym with intermediate perception encoder (PE) features and aggregating responses using log-sum-exp, yielding robust, wording-insensitive intra-class scores.
  • Semantic Evidence Calibration (SEC): Dense PE features at a mid-level layer are L2-normalized and dot-multiplied with text embeddings for each class, producing pixelwise similarity maps. A cross-class softmax normalizes evidence, providing a pixelwise distribution πc(x)\pi_c(x) over candidate classes, capturing relative affinity on a shared, unified scale. This is combined with SAM3 structural evidence—converted to logits—for joint decision-making.

The final scoring function for each class at each pixel is:

Sc(x)=log(Psamc(x)1Psamc(x))+λpriorlogπc(x)+zc,S_c(x) = \log\left(\frac{P^{\mathrm{sam}_c(x)}}{1 - P^{\mathrm{sam}_c(x)}}\right) + \lambda_{\mathrm{prior}}\log\pi_c(x) + z_c,

where zcz_c is the image-level presence logit from SAM3. Figure 2

Figure 2: System overview. Intra-class synonym aggregation provides robust semantic priors, and unified cross-class calibration enables stable competition for pixelwise assignment.

This framework ensures both strong intra-class coherence and calibrated competition, substantially reducing mask overlap and edge ambiguity.

Empirical Results and Quantitative Analysis

CoCo-SAM3's performance was systematically evaluated on eight OVSS benchmarks, spanning Pascal VOC, Pascal Context, COCO-Stuff, ADE20K, and Cityscapes. Key findings include:

  • Superior mIoU Across All Benchmarks: CoCo-SAM3 outperforms all prior state-of-the-art, training-free, and training-based methods, including vanilla SAM3, CorrCLIP, and ReME, with an average mIoU improvement of 6.8 points over vanilla SAM3 and 10.7 points over CorrCLIP.
  • Scalable to Strong Inter-Class Competition: Gains are particularly pronounced under no-background protocols, where inter-class competition is maximal.
  • Minimal Inference Overhead: The computational cost remains only marginally above that of SAM3, with no need for additional models or training. Figure 3

    Figure 3: Qualitative comparison with CorrCLIP and vanilla SAM3, showing CoCo-SAM3’s robust semantic delineation and reduced inter-class confusion.

Ablation studies validated the contribution of each component:

  • SEC alone adds up to 7.1 mIoU over the SAM3 baseline.
  • Synonym Aggregation further improves robustness, particularly for classes with high linguistic diversity.
  • Choice of PE Layer: Mid-level features strike an optimal balance for semantic-structural alignment; layer 18 outperforms both lower and higher layers. Figure 4

    Figure 4: mIoU varies with the PE layer selected for semantic evidence; a mid-level feature (layer 18) yields optimal performance.

Additional Qualitative Evidence

Extended evaluations across a variety of datasets further confirm the generalization and robustness of CoCo-SAM3’s semantic prior and fusion strategy. Figure 5

Figure 5: Additional qualitative results on PC59 demonstrate improved coherence and object delineation.

Figure 6

Figure 6: On COCO-S, CoCo-SAM3 yields reduced fragmentation and correct category assignments for closely related classes.

Figure 7

Figure 7: Results on VOC21 indicate suppression of inter-class mask overlap and improved scene semantics.

Figure 8

Figure 8: Qualitative results on Cityscapes showcase stable segmentation across complex urban scenes.

Figure 9

Figure 9: For ADE20K, the method produces accurate, consistent region labeling even under diverse class sets.

Discussion and Implications

The introduction of a unified-scale semantic prior, coupled with prompt-invariant intra-class aggregation, addresses a core impediment in prompt-driven OVSS from foundation models: cross-class comparability and intra-class stability. By disentangling these inference axes, CoCo-SAM3 enables robust multi-class reasoning without retraining or extensive pre/postprocessing. The methodology:

  • Generalizes to Diverse Class Sets: The system leverages foundation model flexibility for open-world deployment, while mitigating the combinatorial ambiguity inherent to open-vocabulary tasks.
  • Enables Efficient, Training-Free Deployment: With frozen backbones and manageable inference overhead, it is well-suited for scenarios where rapid adaptation to new categories is required.

Potential future extensions include integration with more general LLM-driven prompt expansions, further fusion with visual priors beyond the perception encoder, or adaptation to video and temporal segmentation challenges.

Conclusion

CoCo-SAM3 advances the promptable segmentation paradigm by resolving concept conflict in open-vocabulary multi-class settings via two orthogonal axes: intra-class evidence aggregation and inter-class competition calibration. Achieving leading segmentation accuracy across benchmarks without retraining or multi-model pipelines, it demonstrates that explicit modeling of concept competition is essential for scaling foundation segmentation models to truly open-world applications (2604.19648).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.