ReMeDI-SAM3: Enhanced Memory for Surgical Segmentation

Updated 22 December 2025

The paper introduces a dual-memory strategy with relevance-aware and occlusion-aware modules to enable reliable, occlusion-robust segmentation of surgical instruments.
It employs a piecewise interpolation scheme for scalable temporal memory, allowing effective tracking through occlusion periods exceeding 50 frames.
Feature-based re-identification with temporal voting improves identity recovery, leading to significant mcIoU gains on EndoVis benchmarks.

ReMeDI-SAM3 is a training-free memory-enhanced extension of the SAM3 spatio-temporal segmentation framework, designed for long-horizon, occlusion-robust surgical instrument segmentation in endoscopic video. Addressing major limitations of SAM3—namely, indiscriminate memory updates, limited memory capacity, and insufficient identity recovery after occlusions—ReMeDI-SAM3 integrates a dual-memory strategy, a piecewise interpolation scheme for scalable memory, and a feature-based re-identification module with temporal voting. These enhancements enable reliable segmentation and tracking of multiple surgical instruments under frequent occlusions, rapid motion, and challenging visual artifacts, outperforming both zero-shot and training-based baselines (Bundele et al., 18 Dec 2025).

1. Architectural Overview

ReMeDI-SAM3 operates on each instrument instance within a video sequence $\mathcal V = \{I_1, ..., I_T\}$ , utilizing a prompt per target instrument at a reference frame $t_0$ . The architecture builds upon the SAM3 pipeline, which consists of:

A shared vision backbone producing multi-scale feature maps $\mathbf F_{t,\ell}$
A DETR-style detector outputting objectness scores $s_t$ and mask confidence $c_t$
A mask decoder conditioned on embedded prompts and an attended memory
Temporal mask propagation relying on features from $M$ past frames

ReMeDI-SAM3 introduces a dual-memory design, partitioning memory into:

Relevance-aware memory $\mathcal U_{\rm rel}$ : stores high-confidence reference frames
Occlusion-aware memory $\mathcal U_{\rm occ}$ : retains pre-occlusion cues with relaxed confidence constraints

All frames are buffered in a global memory $\mathcal U$ . At each timestep, current features attend to $\mathcal U_{\rm rel} \cup \mathcal U_{\rm occ}$ for mask prediction $P_t$ . Memory updates and re-identification are dynamically regulated based on segmentation confidence and occlusion detection signals. Quality-weighted mask fusion aggregates recovered instance masks into final per-frame segmentations $\{ \mathcal S_t \}$ .

2. Relevance-Aware and Occlusion-Aware Memory Management

Vanilla SAM3 indiscriminately inserts every predicted mask into memory, propagating errors from ambiguous or noisy frames. In ReMeDI-SAM3, memory insertion is gated by a reliability score $r_t = s_t \times c_t$ , where $s_t$ is objectness and $c_t$ mask confidence. Only masks surpassing a strict threshold $\tau_{\rm rel}$ are stored in $\mathcal U_{\rm rel}$ , retaining the $M/2$ most recent high-reliability frames, with FIFO eviction when full.

Occlusion events are detected via $s_t = 0 \rightarrow s_{t+1} > 0$ . Upon reappearance, $\mathcal U_{\rm occ}$ is populated from pre-occlusion global buffer frames exceeding a relaxed threshold $\tau_{\rm occ} < \tau_{\rm rel}$ , supplying discriminative cues for identity recovery. In non-recovery frames, $\mathcal U_{\rm occ}$ remains empty, ensuring separation of pre- and post-occlusion contexts.

3. Piecewise Interpolation for Scalable Temporal Memory

SAM3's temporal memory is constrained by seven fixed positional embeddings, limiting effective memory size. Direct extrapolation degrades temporal priors. ReMeDI-SAM3 implements a piecewise interpolation scheme:

Fixed boundary embeddings: $\tilde{\mathbf p}_0 = \mathbf p_0$ , $\tilde{\mathbf p}_{M-1} = \mathbf p_6$
Linear resampling of the interior positions $\{\mathbf p_1,...,\mathbf p_5\}$ to $M-2$ evenly spaced indices:

$t_k = \frac{k-1}{M-3}, \quad u_k = 1 + 4 t_k, \quad \alpha_k = u_k - \lfloor u_k \rfloor$

$\tilde{\mathbf p}_k = (1-\alpha_k)\, \mathbf p_{\lfloor u_k \rfloor} + \alpha_k\, \mathbf p_{\lceil u_k \rceil}$

This technique preserves semantic integrity at terminal positions while enabling expansion of memory capacity $M$ without the need for retraining. It supports temporal horizons sufficient for tracking through occlusion periods of over 50 frames.

4. Feature-Based Re-Identification with Temporal Voting

Prolonged occlusions can induce identity drift, especially if pre-occlusion memory is contaminated by ambiguous frames. ReMeDI-SAM3 employs a reference feature bank $\mathcal B^i$ per instrument class, built from frames with:

High reliability ( $r_t \ge \tau_{\rm rel}$ )
High mask agreement among the top-3 SAM3 hypotheses

Each frame is represented by a multi-scale descriptor:

$\mathbf f^{\,i}_{t,\ell} = \frac{1}{|M^i_t|} \sum_{x\in M^i_t} \mathbf F_{t,\ell}(x)$

Upon re-entry after occlusion, during a recovery window of $K$ frames, similarity scores between current and reference features are temporally aggregated:

Mean self-similarity $s^{\rm self}$
Max other-class similarity $s^{\rm other}$
Average spatial overlap ( $\overline{\mathrm{IoU}}$ )

Identity is confirmed if $s^{\rm self} - s^{\rm other} \geq \delta_{\rm sim}$ and $\overline{\mathrm{IoU}} \leq \delta_{\rm iou}$ . Otherwise, reassignment or rejection protocols regulate bank updates. Temporal voting over multiple frames systematically counteracts single-frame artifacts (e.g., specular highlights, partial occlusions) that could lead to false positives or identity swaps.

5. Training and Parameterization

All operational thresholds, including $\tau_{\rm rel}, \tau_{\rm occ}, \delta_{\rm sim}, \delta_{\rm sim}^-, \delta_{\rm iou}, M, K$ , are fixed a priori. No new weights are learned at any stage; all parameterized components—including image encoder, detector, prompt encoder, and mask decoder—are reused verbatim from pretrained SAM3. Thus, ReMeDI-SAM3 functions as a training-free wrapper atop the foundation model (Bundele et al., 18 Dec 2025).

6. Empirical Performance and Comparative Results

ReMeDI-SAM3 demonstrates superior performance on the EndoVis17 (8 sequences, 225 frames per sequence) and EndoVis18 (4 validation videos, 149 frames per sequence) datasets, under zero-shot settings:

Instrument	SAM3 (zero-shot) mcIoU	ReMeDI-SAM3 mcIoU (Δ)
EndoVis17	68.79%	75.65% (+6.86 pp)
EndoVis18	66.46%	82.23% (+15.77 pp)

Further improvements are reported in mean challenge IoU: +7.2 pp for EndoVis17 and approximately +6 pp for EndoVis18. False positives, such as hallucinated tools post-occlusion, are nearly eliminated due to the combined effects of memory filtering and feature-based re-identification. The approach surpasses all previously published training-free baselines and specialist training-based methods on these benchmarks.

7. Strengths, Limitations, and Future Directions

ReMeDI-SAM3 achieves robust suppression of error propagation induced by noisy memory entries (e.g., due to smoke, specularities, or rapid motion), and its dual-memory design preserves identity-critical cues exclusively for occlusion recovery. The piecewise memory expansion permits stable tracking across extended occlusion events without retraining, and temporal voting effectively prevents identity confusion after reappearance.

Identified limitations include temporary recall dips due to conservative memory gating, especially under severe pose changes, and the need for minor threshold tuning when adapting to markedly different domains or frame rates. Prospective enhancements involve adaptive thresholding responsive to scene dynamics, extension to semantic part-level segmentation in surgery, and transferability to natural image tracking settings with multiple distractors.

ReMeDI-SAM3 establishes a principled, training-free methodology that significantly augments the capabilities of the foundational SAM3 model for challenging, occlusion-heavy video segmentation tasks in surgical settings (Bundele et al., 18 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Memory-Enhanced SAM3 for Occlusion-Robust Surgical Instrument Segmentation (2025)

ReMeDI-SAM3: Enhanced Memory for Surgical Segmentation

1. Architectural Overview

2. Relevance-Aware and Occlusion-Aware Memory Management

3. Piecewise Interpolation for Scalable Temporal Memory

4. Feature-Based Re-Identification with Temporal Voting

5. Training and Parameterization

6. Empirical Performance and Comparative Results

7. Strengths, Limitations, and Future Directions

Whiteboard

Follow Topic

Continue Learning

ReMeDI-SAM3: Enhanced Memory for Surgical Segmentation

1. Architectural Overview

2. Relevance-Aware and Occlusion-Aware Memory Management

3. Piecewise Interpolation for Scalable Temporal Memory

4. Feature-Based Re-Identification with Temporal Voting

5. Training and Parameterization

6. Empirical Performance and Comparative Results

7. Strengths, Limitations, and Future Directions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics