- The paper introduces CoCo-SAM3, a framework that decouples multi-class inference using synonym aggregation and semantic evidence calibration to address inter- and intra-class conflicts.
- It demonstrates a significant performance boost with mIoU improvements up to 10.7 points over existing methods across eight diverse benchmarks.
- The approach enables stable segmentation with minimal computational overhead by leveraging frozen backbones and prompt-invariant linguistic cues.
CoCo-SAM3: Harnessing Concept Conflict for Stable Open-Vocabulary Semantic Segmentation
Motivation and Limitations of SAM3 for Open-Vocabulary Multi-Class Inference
Open-vocabulary semantic segmentation (OVSS) presents a unique challenge: assigning pixel-level semantic labels from a potentially unbounded and dynamically specified set of linguistic concepts, without laborious re-annotation or retraining. The prompt-driven mask generation paradigm of SAM3 directly generates masks from textual prompts, offering expressive open-set capabilities. However, two key deficiencies emerge when this system is applied to multi-class OVSS:
- Absence of a Unified Inter-Class Evidence Scale: Independently generated masks for each prompt are not calibrated on a uniform, comparable scale, leading to regions of inter-class overlap and unstable competition for label assignment.
- Intra-Class Inconsistency from Naming Diversity: Synonymous descriptions of a single concept activate distinct semantic/spatial patterns, resulting in intra-class evidence drift that further destabilizes inter-class boundaries.
These issues fundamentally arise from the prompt-conditioned inference mechanism of SAM3, which was trained with one-to-one associations between queries and masks. The consequence is pronounced multi-class ambiguity, with masks that both overlap and are sensitive to prompt wording.
Figure 1: Inter-class mask conflicts in vanilla SAM3 (left), and quantitative analysis of controlled competition (right), demonstrating mIoU reductions as competition becomes more ambiguous.
CoCo-SAM3: Methodological Framework
CoCo-SAM3 explicitly decouples multi-class inference into intra-class enhancement and inter-class competition, introducing two principal mechanisms:
- Synonym Aggregation: For each concept, an LLM-augmented synonym set expands linguistic representations. Semantic evidence for a concept is computed by matching each synonym with intermediate perception encoder (PE) features and aggregating responses using log-sum-exp, yielding robust, wording-insensitive intra-class scores.
- Semantic Evidence Calibration (SEC): Dense PE features at a mid-level layer are L2-normalized and dot-multiplied with text embeddings for each class, producing pixelwise similarity maps. A cross-class softmax normalizes evidence, providing a pixelwise distribution πc(x) over candidate classes, capturing relative affinity on a shared, unified scale. This is combined with SAM3 structural evidence—converted to logits—for joint decision-making.
The final scoring function for each class at each pixel is:
Sc(x)=log(1−Psamc(x)Psamc(x))+λpriorlogπc(x)+zc,
where zc is the image-level presence logit from SAM3.
Figure 2: System overview. Intra-class synonym aggregation provides robust semantic priors, and unified cross-class calibration enables stable competition for pixelwise assignment.
This framework ensures both strong intra-class coherence and calibrated competition, substantially reducing mask overlap and edge ambiguity.
Empirical Results and Quantitative Analysis
CoCo-SAM3's performance was systematically evaluated on eight OVSS benchmarks, spanning Pascal VOC, Pascal Context, COCO-Stuff, ADE20K, and Cityscapes. Key findings include:
Ablation studies validated the contribution of each component:
Additional Qualitative Evidence
Extended evaluations across a variety of datasets further confirm the generalization and robustness of CoCo-SAM3’s semantic prior and fusion strategy.
Figure 5: Additional qualitative results on PC59 demonstrate improved coherence and object delineation.
Figure 6: On COCO-S, CoCo-SAM3 yields reduced fragmentation and correct category assignments for closely related classes.
Figure 7: Results on VOC21 indicate suppression of inter-class mask overlap and improved scene semantics.
Figure 8: Qualitative results on Cityscapes showcase stable segmentation across complex urban scenes.
Figure 9: For ADE20K, the method produces accurate, consistent region labeling even under diverse class sets.
Discussion and Implications
The introduction of a unified-scale semantic prior, coupled with prompt-invariant intra-class aggregation, addresses a core impediment in prompt-driven OVSS from foundation models: cross-class comparability and intra-class stability. By disentangling these inference axes, CoCo-SAM3 enables robust multi-class reasoning without retraining or extensive pre/postprocessing. The methodology:
- Generalizes to Diverse Class Sets: The system leverages foundation model flexibility for open-world deployment, while mitigating the combinatorial ambiguity inherent to open-vocabulary tasks.
- Enables Efficient, Training-Free Deployment: With frozen backbones and manageable inference overhead, it is well-suited for scenarios where rapid adaptation to new categories is required.
Potential future extensions include integration with more general LLM-driven prompt expansions, further fusion with visual priors beyond the perception encoder, or adaptation to video and temporal segmentation challenges.
Conclusion
CoCo-SAM3 advances the promptable segmentation paradigm by resolving concept conflict in open-vocabulary multi-class settings via two orthogonal axes: intra-class evidence aggregation and inter-class competition calibration. Achieving leading segmentation accuracy across benchmarks without retraining or multi-model pipelines, it demonstrates that explicit modeling of concept competition is essential for scaling foundation segmentation models to truly open-world applications (2604.19648).