Papers
Topics
Authors
Recent
2000 character limit reached

ReMeDI-SAM3: Enhanced Memory for Surgical Segmentation

Updated 22 December 2025
  • The paper introduces a dual-memory strategy with relevance-aware and occlusion-aware modules to enable reliable, occlusion-robust segmentation of surgical instruments.
  • It employs a piecewise interpolation scheme for scalable temporal memory, allowing effective tracking through occlusion periods exceeding 50 frames.
  • Feature-based re-identification with temporal voting improves identity recovery, leading to significant mcIoU gains on EndoVis benchmarks.

ReMeDI-SAM3 is a training-free memory-enhanced extension of the SAM3 spatio-temporal segmentation framework, designed for long-horizon, occlusion-robust surgical instrument segmentation in endoscopic video. Addressing major limitations of SAM3—namely, indiscriminate memory updates, limited memory capacity, and insufficient identity recovery after occlusions—ReMeDI-SAM3 integrates a dual-memory strategy, a piecewise interpolation scheme for scalable memory, and a feature-based re-identification module with temporal voting. These enhancements enable reliable segmentation and tracking of multiple surgical instruments under frequent occlusions, rapid motion, and challenging visual artifacts, outperforming both zero-shot and training-based baselines (Bundele et al., 18 Dec 2025).

1. Architectural Overview

ReMeDI-SAM3 operates on each instrument instance within a video sequence V={I1,...,IT}\mathcal V = \{I_1, ..., I_T\}, utilizing a prompt per target instrument at a reference frame t0t_0. The architecture builds upon the SAM3 pipeline, which consists of:

  • A shared vision backbone producing multi-scale feature maps Ft,â„“\mathbf F_{t,\ell}
  • A DETR-style detector outputting objectness scores sts_t and mask confidence ctc_t
  • A mask decoder conditioned on embedded prompts and an attended memory
  • Temporal mask propagation relying on features from MM past frames

ReMeDI-SAM3 introduces a dual-memory design, partitioning memory into:

  • Relevance-aware memory Urel\mathcal U_{\rm rel}: stores high-confidence reference frames
  • Occlusion-aware memory Uocc\mathcal U_{\rm occ}: retains pre-occlusion cues with relaxed confidence constraints

All frames are buffered in a global memory U\mathcal U. At each timestep, current features attend to Urel∪Uocc\mathcal U_{\rm rel} \cup \mathcal U_{\rm occ} for mask prediction PtP_t. Memory updates and re-identification are dynamically regulated based on segmentation confidence and occlusion detection signals. Quality-weighted mask fusion aggregates recovered instance masks into final per-frame segmentations {St}\{ \mathcal S_t \}.

2. Relevance-Aware and Occlusion-Aware Memory Management

Vanilla SAM3 indiscriminately inserts every predicted mask into memory, propagating errors from ambiguous or noisy frames. In ReMeDI-SAM3, memory insertion is gated by a reliability score rt=st×ctr_t = s_t \times c_t, where sts_t is objectness and ctc_t mask confidence. Only masks surpassing a strict threshold τrel\tau_{\rm rel} are stored in Urel\mathcal U_{\rm rel}, retaining the M/2M/2 most recent high-reliability frames, with FIFO eviction when full.

Occlusion events are detected via st=0→st+1>0s_t = 0 \rightarrow s_{t+1} > 0. Upon reappearance, Uocc\mathcal U_{\rm occ} is populated from pre-occlusion global buffer frames exceeding a relaxed threshold τocc<τrel\tau_{\rm occ} < \tau_{\rm rel}, supplying discriminative cues for identity recovery. In non-recovery frames, Uocc\mathcal U_{\rm occ} remains empty, ensuring separation of pre- and post-occlusion contexts.

3. Piecewise Interpolation for Scalable Temporal Memory

SAM3's temporal memory is constrained by seven fixed positional embeddings, limiting effective memory size. Direct extrapolation degrades temporal priors. ReMeDI-SAM3 implements a piecewise interpolation scheme:

  • Fixed boundary embeddings: p~0=p0\tilde{\mathbf p}_0 = \mathbf p_0, p~M−1=p6\tilde{\mathbf p}_{M-1} = \mathbf p_6
  • Linear resampling of the interior positions {p1,...,p5}\{\mathbf p_1,...,\mathbf p_5\} to M−2M-2 evenly spaced indices:

tk=k−1M−3,uk=1+4tk,αk=uk−⌊uk⌋t_k = \frac{k-1}{M-3}, \quad u_k = 1 + 4 t_k, \quad \alpha_k = u_k - \lfloor u_k \rfloor

p~k=(1−αk) p⌊uk⌋+αk p⌈uk⌉\tilde{\mathbf p}_k = (1-\alpha_k)\, \mathbf p_{\lfloor u_k \rfloor} + \alpha_k\, \mathbf p_{\lceil u_k \rceil}

This technique preserves semantic integrity at terminal positions while enabling expansion of memory capacity MM without the need for retraining. It supports temporal horizons sufficient for tracking through occlusion periods of over 50 frames.

4. Feature-Based Re-Identification with Temporal Voting

Prolonged occlusions can induce identity drift, especially if pre-occlusion memory is contaminated by ambiguous frames. ReMeDI-SAM3 employs a reference feature bank Bi\mathcal B^i per instrument class, built from frames with:

  • High reliability (rt≥τrelr_t \ge \tau_{\rm rel})
  • High mask agreement among the top-3 SAM3 hypotheses

Each frame is represented by a multi-scale descriptor:

ft,ℓ i=1∣Mti∣∑x∈MtiFt,ℓ(x)\mathbf f^{\,i}_{t,\ell} = \frac{1}{|M^i_t|} \sum_{x\in M^i_t} \mathbf F_{t,\ell}(x)

Upon re-entry after occlusion, during a recovery window of KK frames, similarity scores between current and reference features are temporally aggregated:

  • Mean self-similarity sselfs^{\rm self}
  • Max other-class similarity sothers^{\rm other}
  • Average spatial overlap (IoU‾\overline{\mathrm{IoU}})

Identity is confirmed if sself−sother≥δsims^{\rm self} - s^{\rm other} \geq \delta_{\rm sim} and IoU‾≤δiou\overline{\mathrm{IoU}} \leq \delta_{\rm iou}. Otherwise, reassignment or rejection protocols regulate bank updates. Temporal voting over multiple frames systematically counteracts single-frame artifacts (e.g., specular highlights, partial occlusions) that could lead to false positives or identity swaps.

5. Training and Parameterization

All operational thresholds, including τrel,τocc,δsim,δsim−,δiou,M,K\tau_{\rm rel}, \tau_{\rm occ}, \delta_{\rm sim}, \delta_{\rm sim}^-, \delta_{\rm iou}, M, K, are fixed a priori. No new weights are learned at any stage; all parameterized components—including image encoder, detector, prompt encoder, and mask decoder—are reused verbatim from pretrained SAM3. Thus, ReMeDI-SAM3 functions as a training-free wrapper atop the foundation model (Bundele et al., 18 Dec 2025).

6. Empirical Performance and Comparative Results

ReMeDI-SAM3 demonstrates superior performance on the EndoVis17 (8 sequences, 225 frames per sequence) and EndoVis18 (4 validation videos, 149 frames per sequence) datasets, under zero-shot settings:

Instrument SAM3 (zero-shot) mcIoU ReMeDI-SAM3 mcIoU (Δ)
EndoVis17 68.79% 75.65% (+6.86 pp)
EndoVis18 66.46% 82.23% (+15.77 pp)

Further improvements are reported in mean challenge IoU: +7.2 pp for EndoVis17 and approximately +6 pp for EndoVis18. False positives, such as hallucinated tools post-occlusion, are nearly eliminated due to the combined effects of memory filtering and feature-based re-identification. The approach surpasses all previously published training-free baselines and specialist training-based methods on these benchmarks.

7. Strengths, Limitations, and Future Directions

ReMeDI-SAM3 achieves robust suppression of error propagation induced by noisy memory entries (e.g., due to smoke, specularities, or rapid motion), and its dual-memory design preserves identity-critical cues exclusively for occlusion recovery. The piecewise memory expansion permits stable tracking across extended occlusion events without retraining, and temporal voting effectively prevents identity confusion after reappearance.

Identified limitations include temporary recall dips due to conservative memory gating, especially under severe pose changes, and the need for minor threshold tuning when adapting to markedly different domains or frame rates. Prospective enhancements involve adaptive thresholding responsive to scene dynamics, extension to semantic part-level segmentation in surgery, and transferability to natural image tracking settings with multiple distractors.

ReMeDI-SAM3 establishes a principled, training-free methodology that significantly augments the capabilities of the foundational SAM3 model for challenging, occlusion-heavy video segmentation tasks in surgical settings (Bundele et al., 18 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to ReMeDI-SAM3.