F2PASeg: Memory-Augmented Pituitary Segmentation
- F2PASeg is a deep neural framework that uses a memory-augmented, promptable architecture with a novel feature fusion module for precise pituitary segmentation.
- It leverages the PAS dataset with instrument multiplexing augmentation to tackle class imbalance and mimic intraoperative challenges such as occlusion and camera motion.
- The approach achieves state-of-the-art segmentation metrics and real-time throughput, enhancing intraoperative planning and risk mitigation in sellar-phase pituitary surgery.
F2PASeg is a memory-augmented, promptable deep neural framework designed for pixel-wise segmentation of pituitary anatomical structures in endoscopic surgery video. It introduces a novel feature fusion module tailored to robustly segment vital anatomical entities even in the presence of typical intraoperative variations such as occlusions, camera motion, and bleeding. F2PASeg was developed in conjunction with the first large-scale, expert-annotated video dataset for pituitary anatomy segmentation (PAS), addressing key challenges in data scarcity and class imbalance endemic to endoscopic surgical video. The approach achieves state-of-the-art segmentation metrics with real-time throughput, supporting intraoperative planning and risk mitigation in sellar-phase pituitary surgery (Chen et al., 7 Aug 2025).
1. PAS Dataset and Annotation Protocol
The PAS dataset underpins F2PASeg’s design and evaluation. It comprises 7,845 temporally coherent frames at resolutions of 1920×1080 or 720×576, extracted from 120 sellar-phase endoscopic pituitary surgery videos. After data augmentation, the training split grows to 9,331 images divided as 88 train, 12 validation, and 20 test cases.
Expert annotation distinguishes six key anatomical classes: sella floor (SF), tuberculum sella (TS), ICA prominence (IP), clival recess (CR), optic carotid recess (OCR), and optic prominence (OP). Pixel-level masks are pre-labeled by researchers and verified by neurosurgeons, ensuring clinical reliability.
Class imbalance is pronounced: common anatomies (SF, TS, CR) are frequent, while IP, OCR, OP are rare (<10% of the set). To boost rare class representation and simulate realistic intraoperative artifacts, the authors introduce “instrument multiplexing.” Masks of eight annotated surgical instruments are superimposed onto original frames in temporally correlated order (“video-reuse” pipeline), increasing the prevalence of IP to 40.83% and OCR to 32.50%. This augmentation also models typical occlusion and motion patterns seen in endoscopy.
2. Network Architecture and Feature Fusion
F2PASeg adopts the SAM2 backbone, which comprises a pre-trained Vision Transformer (ViT) image encoder, a prompt/memory encoder, and a transformer-based mask decoder. Temporal context is maintained by a FIFO memory bank storing the two most recent prompted frames plus all intervening predicted ones, preserving spatial-temporal cues through a memory-attention module.
A central innovation is the feature fusion module, which replaces the standard SAM2 skip-connection in the mask decoder. At two decoder stages (output stride 4 and 8), F2PASeg fuses high-resolution features () and upsampled memory-attention features () using a residual block and an explicit LoRA (Low-Rank Adapter) branch:
- Residual fusion: , where is a Conv–BatchNorm–ReLU pipeline and is ReLU.
- LoRA branch: , with , (), trainable, and 0 a small conv block.
This dual-pathway enriches the integration of spatial cues with deep semantic context, improving robustness to intraoperative scene dynamics. The module increases parameters by only ~4.2M (34.8M vs. 39.0M for a full SAM2-t fine-tune).
Six channel-wise soft masks are output, corresponding to the anatomical classes.
3. Training Regimen and Loss Formulation
F2PASeg is initialized from SAM2-t pretrained weights with the mask decoder frozen; all other modules are fine-tuned. Bounding box prompts are supplied every 10 frames per anatomy, reflecting realistic semi-automated surgical deployment.
The loss function is a weighted sum of four pixel-level terms:
1
with coefficients 2 (following [Ravi et al. 24]). Optimizer: AdamW (3, 4), base learning rate 5. Training runs for 40 epochs on dual NVIDIA A100 GPUs (PyTorch 2.5.1, Python 3.12.8).
4. Empirical Results: Quantitative and Qualitative Evaluation
Comprehensive testing on the PAS set demonstrates the superiority of F2PASeg over both medical and generalist segmentation baselines in terms of both mIoU and mean Dice.
| Model | mIoU | Dice Mean | IP Dice | OCR Dice | FPS (A100) |
|---|---|---|---|---|---|
| Swin-UNet | 0.1872 | 0.2509 | 0.0114 | 0.0121 | - |
| Trans-UNet | 0.2192 | 0.2847 | 0.0222 | 0.0002 | - |
| DeepLabV3+ | 0.2085 | 0.2434 | 0.0002 | 0.0017 | - |
| LiVOS | 0.4264 | 0.5057 | 0.263 | 0.161 | - |
| SAM | 0.6090 | 0.7188 | 0.599 | 0.725 | - |
| MedSAM | 0.7086 | 0.8166 | 0.737 | 0.779 | - |
| SAM2 | 0.7681 | 0.8397 | 0.730 | 0.805 | - |
| F2PASeg | 0.7701 | 0.8559 | 0.743 | 0.813 | 28.57 |
| F2PASeg + Aug | 0.7796 | 0.8635 | 0.782 | 0.818 | 28.57 |
F2PASeg with augmentation (F2PASeg + Aug) achieves mean Dice 86.35% (vs. 83.97% for SAM2), with notable gains on classes with lowest pre-augmentation prevalence. Ablation establishes the additive impact of both fusion and augmentation. Inference speed is 28.57 FPS on NVIDIA A100—2.3× faster than SAM-Med2D—enabling real-time deployment.
Qualitatively, F2PASeg shows enhanced temporal consistency and fewer annotation gaps during occlusions, endoscopic motion, and bleeding when compared to SAM2 and MedSAM.
5. Contributions and Innovations
The following unique contributions are explicitly documented:
- PAS Dataset: The first large-scale, expert-annotated, temporally coherent pixel-wise video dataset for pituitary anatomy during the sellar phase. It provides a foundation for future research in surgical scene understanding.
- F2PASeg Architecture: Development of a promptable, memory-augmented segmentation network leveraging residual+LoRA feature fusion for improved anatomical delineation and temporal stability.
- Instrument Multiplexing Augmentation: Introduction of a novel pipeline that simulates realistic intraoperative conditions, correcting intrinsic class imbalance in the source data.
These innovations collectively address key shortcomings in previous work, such as insufficient data diversity, poor robustness to occlusion, and limited temporal modeling.
6. Limitations and Prospective Directions
The approach is constrained to the sellar phase and evaluation is limited to six anatomical entities. Generalization to the full surgical workflow and increased anatomical granularity is noted as an outstanding aim. The reliance on periodic bounding-box prompts (every 10 frames) may necessitate semi-manual annotation, and automated prompt generation or prompt-free inference are identified as future research trajectories.
Temporal modeling employs only short memory (FIFO of two prompt frames), limiting context length; adaptive long-range temporal attention is suggested as a future direction to further stabilize segmentation over extended video sequences. Additionally, the extension of PAS and F2PASeg to other surgical sub-phases is anticipated (Chen et al., 7 Aug 2025).