Papers
Topics
Authors
Recent
2000 character limit reached

DualProtoSeg: Simple and Efficient Design with Text- and Image-Guided Prototype Learning for Weakly Supervised Histopathology Image Segmentation

Published 11 Dec 2025 in cs.CV | (2512.10314v1)

Abstract: Weakly supervised semantic segmentation (WSSS) in histopathology seeks to reduce annotation cost by learning from image-level labels, yet it remains limited by inter-class homogeneity, intra-class heterogeneity, and the region-shrinkage effect of CAM-based supervision. We propose a simple and effective prototype-driven framework that leverages vision-language alignment to improve region discovery under weak supervision. Our method integrates CoOp-style learnable prompt tuning to generate text-based prototypes and combines them with learnable image prototypes, forming a dual-modal prototype bank that captures both semantic and appearance cues. To address oversmoothing in ViT representations, we incorporate a multi-scale pyramid module that enhances spatial precision and improves localization quality. Experiments on the BCSS-WSSS benchmark show that our approach surpasses existing state-of-the-art methods, and detailed analyses demonstrate the benefits of text description diversity, context length, and the complementary behavior of text and image prototypes. These results highlight the effectiveness of jointly leveraging textual semantics and visual prototype learning for WSSS in digital pathology.

Summary

  • The paper introduces a dual-modal prototype architecture that leverages learnable text and image prototypes to generate refined segmentation masks from weak supervision.
  • The approach achieves competitive performance with 71.35% mIoU and 83.14% mDice on the BCSS-WSSS benchmark, demonstrating enhanced detection of tumor and necrosis regions.
  • The system combines multi-scale feature extraction and adaptive prompt tuning to address intra-class heterogeneity and improve boundary detection in histopathology images.

DualProtoSeg: A Dual-Prototype Vision-Language Framework for Weakly Supervised Histopathology Segmentation

Introduction

DualProtoSeg introduces a hybrid prototype-based methodology that advances weakly supervised semantic segmentation (WSSS) in histopathology by leveraging both image-guided and text-guided prototype learning within a unified framework. In histopathology, the annotation of dense segmentation masks is cost-prohibitive. WSSS using image-level supervision circumvents this, but previous approaches—primarily relying on Class Activation Maps (CAMs) or visual prototypes alone—are vulnerable to intra-class heterogeneity and inter-class homogeneity, while struggling to capture the full morphological diversity present in tissue sections. DualProtoSeg directly addresses these deficits by fusing semantically rich vision–language representations and dual-modal prototype learning.

Framework and Methodology

Dual-Modal Prototype Architecture

The core of DualProtoSeg is a dual-modal prototype bank that integrates both text-based prototypes, constructed from learnable prompt tokens and diverse class descriptions, and image-based prototypes, learned from CONCH ViT-B/16 visual features (Figure 1). Figure 1

Figure 1: Overview of the Dual-ProtoSeg framework. Multi-scale image features (Image Branch) and prompt-guided text embeddings (Text Branch) form a dual-modal prototype bank, whose similarity to visual features generates multi-scale CAMs for weakly supervised segmentation (Prediction Heads).

This architecture enables the model to align visual and textual representations per class. Each class is defined by multiple textual descriptions, each passed through a prompt-optimized frozen text encoder to yield a set of diverse text prototypes. Image prototypes are learned independently. Text and image prototypes are jointly projected onto multi-scale visual features, and their cosine similarities with local image features across scales generate hierarchical class activation maps, which are then fused and refined to produce dense segmentation pseudo-masks.

Multi-Scale Feature Extraction and Refinement

Multi-scale visual features are extracted by forwarding histopathology image patches through the CONCH ViT-B/16 encoder. Hidden states from several layers are spatially re-arranged and further processed through lightweight refinement and pyramid modules. This design mitigates the oversmoothing typical of ViT representations and provides high-resolution features for prototype matching.

Learnable Prompt Tuning and Prototype Bank Construction

Conditional prompt tuning (as in CoOp) parameterizes class-specific learnable tokens for each class and description, combined with variable-length natural language templates. Using ndescn_{\text{desc}} (e.g., 10) textual descriptions per class exposes the model to greater semantic diversity, and learnable context tokens allow task/domain-adaptation of text encodings. The resultant dual-modal bank at each scale contains both semantic (text) and morphological (image) anchor points for similarity matching.

Dense Mask Generation and Post-processing

Multi-scale CAMs derived from dual prototypes are upsampled, fused, and refined via DenseCRF. The training objective consists of multi-level classification losses and a semantic alignment loss that regularizes text–image prototype pairs, alongside diversity regularization to prevent prototype collapse.

Empirical Evaluation

Quantitative Results

On the BCSS-WSSS benchmark, DualProtoSeg achieves 71.35% mIoU and 83.14% mDice, exceeding PBIP by +1.93% and +1.30% on these metrics. Notably, it establishes new maxima for tumor IoU (81.34%) and necrosis IoU (69.83%), and achieves the highest Dice for lymphocyte (79.08%). These are consistent improvements under minimal supervision, emphasizing the effectiveness of multimodal prototype ensembles.

Qualitative Outcomes

Segmentation outputs demonstrate favorable boundary crispness and full-structure coverage compared to prior approaches (Figure 2), aligning well with expert annotations. Figure 2

Figure 2: Qualitative results on BCSS-WSSS test patches. GT denotes ground truth segmentation.

Further analysis reveals that text-based prototypes typically localize broad, semantically consistent regions, while image-based prototypes recover areas overlooked by text guidance and focus on fine-grain morphological details (Figure 3). Figure 3

Figure 3: Complementary activation patterns of text-based (first row) and image-based (second row) prototypes. The first column shows the predicted mask (top) and the ground-truth mask (bottom). (a) Stroma: image prototypes recover missed regions (blue boxes). (b) Tumor: image prototypes detect fine-grained structures (purple boxes).

Analysis and Ablation

The ablation studies support the dual-prototype framework’s complementarity. The inclusion of image prototypes consistently improves mIoU, most notably for morphologically variable classes. Increasing the number and diversity of class-specific textual descriptions further enhances performance; with 10 descriptions per class and prompt context length of 16, the model yields its strongest results.

A pivotal finding is that the zero-shot text–image retrieval AUC (using the pretrained CONCH encoder) correlates tightly with downstream segmentation accuracy (Figure 4). Hence, evaluating prototype description quality pre-training can anticipate mask performance and guide prompt engineering effectively. Figure 4

Figure 4: Relationship between zero-shot text–image alignment (AUC) and downstream segmentation IoU.

Theoretical and Practical Implications

DualProtoSeg demonstrates that semantically informed, multimodal prototype assemblies alleviate the limitations of classical CAM and visual-only prototype methods. By exploiting pathology-specific vision–language alignment, it robustly addresses intra-class heterogeneity while improving coarse-to-fine region coverage. The hybrid bank is less susceptible to prototype collapse and is computationally efficient compared to clustering-based paradigms, as all prototype optimization occurs through backpropagation.

The proposed prompt tuning module confirms the value of co-optimizing text encoders for the target domain, providing a systematic path for adaptation with minimal loss of semantic integrity. In practice, the approach unlocks more accurate, label-efficient segmentation in digital pathology and could be generalized to other vision–language segmentation tasks in domains where annotation cost or ambiguity is high.

Future Directions

The compelling results invite future exploration of adaptive prompt generation, prompt selection strategies based on class-specific retrieval AUCs, and explicit modeling of relationships between textual concept granularity and visual prototype distribution. Moreover, extending DualProtoSeg to multi-organ or pan-cancer histopathology with open-vocabulary class descriptions is a promising research vector, especially as larger domain-pretrained foundation models become available.

Conclusion

DualProtoSeg presents a methodologically sound and empirically validated approach for weakly supervised histopathological segmentation, combining learnable image and text prototypes within a unified, multi-scale, CLIP-based vision–language framework. Empirical gains are consistent, especially for morphologically ambiguous regions, underlining the practical utility of multimodal prototypes and adaptive prompt learning in low-supervision regimes.

Whiteboard

Paper to Video (Beta)

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.