- The paper introduces a dual-modal prototype architecture that leverages learnable text and image prototypes to generate refined segmentation masks from weak supervision.
- The approach achieves competitive performance with 71.35% mIoU and 83.14% mDice on the BCSS-WSSS benchmark, demonstrating enhanced detection of tumor and necrosis regions.
- The system combines multi-scale feature extraction and adaptive prompt tuning to address intra-class heterogeneity and improve boundary detection in histopathology images.
DualProtoSeg: A Dual-Prototype Vision-Language Framework for Weakly Supervised Histopathology Segmentation
Introduction
DualProtoSeg introduces a hybrid prototype-based methodology that advances weakly supervised semantic segmentation (WSSS) in histopathology by leveraging both image-guided and text-guided prototype learning within a unified framework. In histopathology, the annotation of dense segmentation masks is cost-prohibitive. WSSS using image-level supervision circumvents this, but previous approaches—primarily relying on Class Activation Maps (CAMs) or visual prototypes alone—are vulnerable to intra-class heterogeneity and inter-class homogeneity, while struggling to capture the full morphological diversity present in tissue sections. DualProtoSeg directly addresses these deficits by fusing semantically rich vision–language representations and dual-modal prototype learning.
Framework and Methodology
Dual-Modal Prototype Architecture
The core of DualProtoSeg is a dual-modal prototype bank that integrates both text-based prototypes, constructed from learnable prompt tokens and diverse class descriptions, and image-based prototypes, learned from CONCH ViT-B/16 visual features (Figure 1).
Figure 1: Overview of the Dual-ProtoSeg framework. Multi-scale image features (Image Branch) and prompt-guided text embeddings (Text Branch) form a dual-modal prototype bank, whose similarity to visual features generates multi-scale CAMs for weakly supervised segmentation (Prediction Heads).
This architecture enables the model to align visual and textual representations per class. Each class is defined by multiple textual descriptions, each passed through a prompt-optimized frozen text encoder to yield a set of diverse text prototypes. Image prototypes are learned independently. Text and image prototypes are jointly projected onto multi-scale visual features, and their cosine similarities with local image features across scales generate hierarchical class activation maps, which are then fused and refined to produce dense segmentation pseudo-masks.
Multi-Scale Feature Extraction and Refinement
Multi-scale visual features are extracted by forwarding histopathology image patches through the CONCH ViT-B/16 encoder. Hidden states from several layers are spatially re-arranged and further processed through lightweight refinement and pyramid modules. This design mitigates the oversmoothing typical of ViT representations and provides high-resolution features for prototype matching.
Learnable Prompt Tuning and Prototype Bank Construction
Conditional prompt tuning (as in CoOp) parameterizes class-specific learnable tokens for each class and description, combined with variable-length natural language templates. Using ndesc​ (e.g., 10) textual descriptions per class exposes the model to greater semantic diversity, and learnable context tokens allow task/domain-adaptation of text encodings. The resultant dual-modal bank at each scale contains both semantic (text) and morphological (image) anchor points for similarity matching.
Dense Mask Generation and Post-processing
Multi-scale CAMs derived from dual prototypes are upsampled, fused, and refined via DenseCRF. The training objective consists of multi-level classification losses and a semantic alignment loss that regularizes text–image prototype pairs, alongside diversity regularization to prevent prototype collapse.
Empirical Evaluation
Quantitative Results
On the BCSS-WSSS benchmark, DualProtoSeg achieves 71.35% mIoU and 83.14% mDice, exceeding PBIP by +1.93% and +1.30% on these metrics. Notably, it establishes new maxima for tumor IoU (81.34%) and necrosis IoU (69.83%), and achieves the highest Dice for lymphocyte (79.08%). These are consistent improvements under minimal supervision, emphasizing the effectiveness of multimodal prototype ensembles.
Qualitative Outcomes
Segmentation outputs demonstrate favorable boundary crispness and full-structure coverage compared to prior approaches (Figure 2), aligning well with expert annotations.
Figure 2: Qualitative results on BCSS-WSSS test patches. GT denotes ground truth segmentation.
Further analysis reveals that text-based prototypes typically localize broad, semantically consistent regions, while image-based prototypes recover areas overlooked by text guidance and focus on fine-grain morphological details (Figure 3).
Figure 3: Complementary activation patterns of text-based (first row) and image-based (second row) prototypes. The first column shows the predicted mask (top) and the ground-truth mask (bottom). (a) Stroma: image prototypes recover missed regions (blue boxes). (b) Tumor: image prototypes detect fine-grained structures (purple boxes).
Analysis and Ablation
The ablation studies support the dual-prototype framework’s complementarity. The inclusion of image prototypes consistently improves mIoU, most notably for morphologically variable classes. Increasing the number and diversity of class-specific textual descriptions further enhances performance; with 10 descriptions per class and prompt context length of 16, the model yields its strongest results.
A pivotal finding is that the zero-shot text–image retrieval AUC (using the pretrained CONCH encoder) correlates tightly with downstream segmentation accuracy (Figure 4). Hence, evaluating prototype description quality pre-training can anticipate mask performance and guide prompt engineering effectively.
Figure 4: Relationship between zero-shot text–image alignment (AUC) and downstream segmentation IoU.
Theoretical and Practical Implications
DualProtoSeg demonstrates that semantically informed, multimodal prototype assemblies alleviate the limitations of classical CAM and visual-only prototype methods. By exploiting pathology-specific vision–language alignment, it robustly addresses intra-class heterogeneity while improving coarse-to-fine region coverage. The hybrid bank is less susceptible to prototype collapse and is computationally efficient compared to clustering-based paradigms, as all prototype optimization occurs through backpropagation.
The proposed prompt tuning module confirms the value of co-optimizing text encoders for the target domain, providing a systematic path for adaptation with minimal loss of semantic integrity. In practice, the approach unlocks more accurate, label-efficient segmentation in digital pathology and could be generalized to other vision–language segmentation tasks in domains where annotation cost or ambiguity is high.
Future Directions
The compelling results invite future exploration of adaptive prompt generation, prompt selection strategies based on class-specific retrieval AUCs, and explicit modeling of relationships between textual concept granularity and visual prototype distribution. Moreover, extending DualProtoSeg to multi-organ or pan-cancer histopathology with open-vocabulary class descriptions is a promising research vector, especially as larger domain-pretrained foundation models become available.
Conclusion
DualProtoSeg presents a methodologically sound and empirically validated approach for weakly supervised histopathological segmentation, combining learnable image and text prototypes within a unified, multi-scale, CLIP-based vision–language framework. Empirical gains are consistent, especially for morphologically ambiguous regions, underlining the practical utility of multimodal prototypes and adaptive prompt learning in low-supervision regimes.