ReMeDI-SAM3: Enhanced Memory for Surgical Segmentation
- The paper introduces a dual-memory strategy with relevance-aware and occlusion-aware modules to enable reliable, occlusion-robust segmentation of surgical instruments.
- It employs a piecewise interpolation scheme for scalable temporal memory, allowing effective tracking through occlusion periods exceeding 50 frames.
- Feature-based re-identification with temporal voting improves identity recovery, leading to significant mcIoU gains on EndoVis benchmarks.
ReMeDI-SAM3 is a training-free memory-enhanced extension of the SAM3 spatio-temporal segmentation framework, designed for long-horizon, occlusion-robust surgical instrument segmentation in endoscopic video. Addressing major limitations of SAM3—namely, indiscriminate memory updates, limited memory capacity, and insufficient identity recovery after occlusions—ReMeDI-SAM3 integrates a dual-memory strategy, a piecewise interpolation scheme for scalable memory, and a feature-based re-identification module with temporal voting. These enhancements enable reliable segmentation and tracking of multiple surgical instruments under frequent occlusions, rapid motion, and challenging visual artifacts, outperforming both zero-shot and training-based baselines (Bundele et al., 18 Dec 2025).
1. Architectural Overview
ReMeDI-SAM3 operates on each instrument instance within a video sequence , utilizing a prompt per target instrument at a reference frame . The architecture builds upon the SAM3 pipeline, which consists of:
- A shared vision backbone producing multi-scale feature maps
- A DETR-style detector outputting objectness scores and mask confidence
- A mask decoder conditioned on embedded prompts and an attended memory
- Temporal mask propagation relying on features from past frames
ReMeDI-SAM3 introduces a dual-memory design, partitioning memory into:
- Relevance-aware memory : stores high-confidence reference frames
- Occlusion-aware memory : retains pre-occlusion cues with relaxed confidence constraints
All frames are buffered in a global memory . At each timestep, current features attend to for mask prediction . Memory updates and re-identification are dynamically regulated based on segmentation confidence and occlusion detection signals. Quality-weighted mask fusion aggregates recovered instance masks into final per-frame segmentations .
2. Relevance-Aware and Occlusion-Aware Memory Management
Vanilla SAM3 indiscriminately inserts every predicted mask into memory, propagating errors from ambiguous or noisy frames. In ReMeDI-SAM3, memory insertion is gated by a reliability score , where is objectness and mask confidence. Only masks surpassing a strict threshold are stored in , retaining the most recent high-reliability frames, with FIFO eviction when full.
Occlusion events are detected via . Upon reappearance, is populated from pre-occlusion global buffer frames exceeding a relaxed threshold , supplying discriminative cues for identity recovery. In non-recovery frames, remains empty, ensuring separation of pre- and post-occlusion contexts.
3. Piecewise Interpolation for Scalable Temporal Memory
SAM3's temporal memory is constrained by seven fixed positional embeddings, limiting effective memory size. Direct extrapolation degrades temporal priors. ReMeDI-SAM3 implements a piecewise interpolation scheme:
- Fixed boundary embeddings: ,
- Linear resampling of the interior positions to evenly spaced indices:
This technique preserves semantic integrity at terminal positions while enabling expansion of memory capacity without the need for retraining. It supports temporal horizons sufficient for tracking through occlusion periods of over 50 frames.
4. Feature-Based Re-Identification with Temporal Voting
Prolonged occlusions can induce identity drift, especially if pre-occlusion memory is contaminated by ambiguous frames. ReMeDI-SAM3 employs a reference feature bank per instrument class, built from frames with:
- High reliability ()
- High mask agreement among the top-3 SAM3 hypotheses
Each frame is represented by a multi-scale descriptor:
Upon re-entry after occlusion, during a recovery window of frames, similarity scores between current and reference features are temporally aggregated:
- Mean self-similarity
- Max other-class similarity
- Average spatial overlap ()
Identity is confirmed if and . Otherwise, reassignment or rejection protocols regulate bank updates. Temporal voting over multiple frames systematically counteracts single-frame artifacts (e.g., specular highlights, partial occlusions) that could lead to false positives or identity swaps.
5. Training and Parameterization
All operational thresholds, including , are fixed a priori. No new weights are learned at any stage; all parameterized components—including image encoder, detector, prompt encoder, and mask decoder—are reused verbatim from pretrained SAM3. Thus, ReMeDI-SAM3 functions as a training-free wrapper atop the foundation model (Bundele et al., 18 Dec 2025).
6. Empirical Performance and Comparative Results
ReMeDI-SAM3 demonstrates superior performance on the EndoVis17 (8 sequences, 225 frames per sequence) and EndoVis18 (4 validation videos, 149 frames per sequence) datasets, under zero-shot settings:
| Instrument | SAM3 (zero-shot) mcIoU | ReMeDI-SAM3 mcIoU (Δ) |
|---|---|---|
| EndoVis17 | 68.79% | 75.65% (+6.86 pp) |
| EndoVis18 | 66.46% | 82.23% (+15.77 pp) |
Further improvements are reported in mean challenge IoU: +7.2 pp for EndoVis17 and approximately +6 pp for EndoVis18. False positives, such as hallucinated tools post-occlusion, are nearly eliminated due to the combined effects of memory filtering and feature-based re-identification. The approach surpasses all previously published training-free baselines and specialist training-based methods on these benchmarks.
7. Strengths, Limitations, and Future Directions
ReMeDI-SAM3 achieves robust suppression of error propagation induced by noisy memory entries (e.g., due to smoke, specularities, or rapid motion), and its dual-memory design preserves identity-critical cues exclusively for occlusion recovery. The piecewise memory expansion permits stable tracking across extended occlusion events without retraining, and temporal voting effectively prevents identity confusion after reappearance.
Identified limitations include temporary recall dips due to conservative memory gating, especially under severe pose changes, and the need for minor threshold tuning when adapting to markedly different domains or frame rates. Prospective enhancements involve adaptive thresholding responsive to scene dynamics, extension to semantic part-level segmentation in surgery, and transferability to natural image tracking settings with multiple distractors.
ReMeDI-SAM3 establishes a principled, training-free methodology that significantly augments the capabilities of the foundational SAM3 model for challenging, occlusion-heavy video segmentation tasks in surgical settings (Bundele et al., 18 Dec 2025).