SAM-DAQ: Segment Anything with Depth-adaptive Queries

Updated 15 November 2025

The paper introduces a prompt-free adaptation of a frozen SAM2 encoder using parallel depth-guided adapters and a query-driven temporal memory module to reduce manual prompts and GPU overhead.
It employs a multi-modal image encoder that fuses RGB and depth features via skip-connected adapters, enhancing spatial-temporal segmentation accuracy.
The framework achieves state-of-the-art performance on RGB-D VSOD benchmarks with measurable improvements in metrics like Eξ, Sα, Fβ, and reduced memory demands.

The Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ) is a deep learning framework designed for RGB-D Video Salient Object Detection (VSOD), the problem of identifying moving, visually attractive objects in video sequences that contain both color (RGB) and depth (D) information for each frame. SAM-DAQ builds upon the Segment Anything Model (SAM2), overcoming three core limitations of state-of-the-art approaches: dependence on manual prompts, high GPU memory requirements from adapter architectures, and the computational expense of temporal attention across large memory banks. The framework introduces a prompt-free, parallel adapter-based multi-modal image encoder and a query-driven temporal memory module, achieving high performance and efficiency for RGB-D VSOD tasks.

1. Motivation and Key Innovations

RGB-D VSOD requires accurate spatial-temporal segmentation leveraging both color and depth features across frames. Existing vision foundation models like SAM or SAM2 excel at universal image segmentation but typically rely on manual point or box prompts at inference, which is impractical for automated video tasks. They also incur significant computational costs when adapted to sequential input streams, primarily due to memory-intensive adapter networks and attention over large external memory banks.

SAM-DAQ provides three fundamental advancements:

Prompt-free fine-tuning: Adapts a frozen SAM2 encoder using parallel depth-guided adapters (DPAs), eliminating the need for manual prompts and reducing memory overhead.
Depth-guided multi-modal fusion: Integrates RGB and depth cues via a skip-connected, adapter-based fusion architecture for enhanced spatial representation.
Query-driven temporal memory (QTM): Replaces large external memory banks and prompt embeddings with learnable frame-level and video-level queries, enabling efficient exploitation of temporal consistency.

These mechanisms enable SAM-DAQ to pop-out salient objects automatically in RGB-D video streams.

2. Architecture and Workflow

The overall SAM-DAQ pipeline consists of sequential modules that process RGB-D frames and generate segmentation predictions for each video frame. The core architectural blocks are as follows:

Input: Each frame consists of an RGB image $I_0 \in \mathbb{R}^{3\times H\times W}$ and a depth map $I_D \in \mathbb{R}^{1\times H\times W}$ .
PAMIE (Parallel Adapter-based Multi-modal Image Encoder):
- Employs a frozen SAM2 encoder (Hiera) at four spatial scales.
- Utilizes a depth projector and two parallel DPAs (depth and multi-modal) inserted in a skip-connection configuration at each Hiera block.
- Produces multi-scale embeddings $\{E_I^2, E_I^3, E_I^4\}$ .
QTM (Query-Driven Temporal Memory):
- Static frame-level queries $Q_f \in \mathbb{R}^{N_f \times d_q}$ attend to encoder outputs at each frame.
- Video-level queries $Q_v \in \mathbb{R}^{N_v \times d_q}$ undergo cross-attention with frame embeddings, then are iteratively updated.
- Outputs concatenated learnable embeddings $E_L = \{E_f, \tilde Q_v\}$ .
Mask Decoder: Ingests multi-scale encoder outputs and learnable queries to produce per-frame segmentation $P_t$ .
Memory Update Module: Encodes $(E_I^4, P_t)$ from the current frame to update $Q_v$ for subsequent frames.

All SAM2 encoder parameters remain frozen during training, with gradients flowing only through the adapters, queries, memory updates, and decoder.

3. Depth-Guided Parallel Adapters and Feature Fusion

The PAMIE module integrates DPAs in a parallel, skip-connected manner for efficient multi-modal fusion:

Depth Adapter: Processes depth features $F_D^{i-1}$ with a down-projection, nonlinearity (GELU or ReLU), and up-projection, yielding $\tilde F_D^{i-1}$ .
Multi-modal Adapter: Concatenates RGB and depth features $[F_{RGB}^{i-1}; F_D^{i-1}]$ before adapter transformation to produce $\tilde F_{RGB}^{i-1}$ .

Feature fusion is formalized as:

$F_{\text{fused}}^i = \text{Hiera}^i(F_{RGB}^{i-1}) + \text{DS}\Big(\text{Adapter}\big([F_{RGB}^{i-1}, F_D^{i-1}]\big)\Big)$

where DS denotes down-sampling to ensure matching spatial resolutions.

Internally, each adapter comprises:

Linear down-projection ( $d_{in} \to d_{mid}$ )
Activation (GELU/ReLU)
Linear up-projection ( $d_{mid} \to d_{in}$ )
Optional layer normalization before skip addition

The parallel DPA strategy maintains low GPU memory usage (21 GB), compared to sequential adapters or LoRA, which can require approximately 95 GB.

4. Query-Driven Temporal Memory (QTM) Mechanism

The QTM module unifies temporal memory and prompt embeddings through learnable queries:

Initialization: Frame-level queries $Q_f$ are randomly initialized and shared for the duration of each video; video-level queries $Q_v$ are either learned per video or globally.
Forward Pass:

Project queries via learned weights: $Q'_f = W_f^Q\, Q_f$ , $Q'_v = W_v^Q\, Q_v$
Frame-wise attention: $E_f = \text{Linear}(Q'_f\, E_I^4)$
Video-level cross-attention: $\tilde Q_v = \text{CrossAttn}(Q'_v, E_f) + Q'_v$
Concatenate $E_f$ and $\tilde Q_v$ , forming $E_L$

Temporal Update:

Memory feature extraction utilizes the mask prediction and encoder output:

$F_m = W_m\, (\text{MemEnc}(E_I^4, P_t))$

Video-level queries are updated:

$Q_{v, t+1} = Q_{v, t} + \text{FFN}\Bigl(\text{SelfAttn}\bigl(\text{CrossAttn}(Q_{v, t}, F_m)\bigr)\Bigr)$

This configuration allows efficient selection and updating of salient temporal context, outperforming static memory bank attention approaches while maintaining a much smaller memory footprint.

5. Training Protocol and Loss Functions

SAM-DAQ employs strict prompt-free training:

All SAM2 encoder weights are fixed.
Parameters of DPAs, query embedding layers, memory update layers, and the mask decoder are trainable.
Optimization uses AdamW (learning rate $1e^{-4}$ , weight decay $0.05$), batch size 1 (sampling 10 frames per video per epoch), total 2,000 iterations. Frame resolution is $1,024 \times 1,024$ .
Data augmentation is limited to random frame sampling and resizing.

Losses comprise:

Final prediction: $\mathcal{L}_{\mathrm{pred}} = \mathrm{BCE}(P_t, GT_t)$
Intermediate supervision at highest embedding level: $\mathcal{L}_{\mathrm{inter}} = \mathrm{BCE}(\tilde P^4, GT_t)$
Total: $\mathcal{L}_{\mathrm{total}} = \lambda_1 \mathcal{L}_{\mathrm{pred}} + \lambda_2 \mathcal{L}_{\mathrm{inter}}$ , with $\lambda_1=1.0$ , $\lambda_2=0.4$

6. Algorithmic Workflow

The training and inference loop can be described as:

Given: Video frames {(I_{RGB,t}, I_{D,t})}_{t=1}^T, ground-truth masks {GT_t}
Initialize: Q_f, Q_v randomly; frozen SAM2 encoder; trainable DPAs, query layers, mask decoder.
for each training iteration do
  Sample a video; pick frames t=1…T sampled in batch.
  for each frame t in batch do
    # 1. Multi-modal encoding (PAMIE)
    D_0 ← LinearDepthProj(I_{D,t})
    R_0 ← PatchEmbed(I_{RGB,t})
    for i=1…4 do
      tildeD_{i-1} ← Adapter_D(D_{i-1})
      D_i ← Hiera^i(D_{i-1}) + DS(tildeD_{i-1})
      tildeR_{i-1} ← Adapter_RGB([R_{i-1}; D_{i-1}])
      R_i ← Hiera^i(R_{i-1}) + DS(tildeR_{i-1})
    end
    {E_I²,E_I³,E_I⁴} ← FPN({R₂,R₃,R₄})

    # 2. Query-Driven Memory (QTM)
    Q'_f ← W_f^Q Q_f;  Q'_v ← W_v^Q Q_v
    E_f  ← Linear( Q'_f · E_I⁴ )
    Q_v̂ ← CrossAttn(Q'_v, E_f) + Q'_v
    E_L  ← concat(E_f, Q_v̂)

    # 3. Mask decoding
    P_t ← MaskDecoder(E_I²,E_I³,E_I⁴, E_L)

    # 4. Loss computation
    L_pred  ← BCE(P_t, GT_t)
    L_inter ← BCE(sigmoid(Conv(E_I⁴)), GT_t)
    L_total ← L_pred + 0.4·L_inter

    # 5. Backward (only DPAs, QTM layers, decoder)
    ∇θ ← ∂L_total / ∂θ;  θ ← θ − lr·AdamW(∇θ)

    # 6. Memory update for next frame
    F_m   ← Linear( MemEnc(E_I⁴, P_t) )
    Q_v   ← Q_v + FFN(SelfAttn(CrossAttn(Q_v, F_m)))
  end
end

7. Experimental Evaluation and Ablation Outcomes

SAM-DAQ achieves state-of-the-art performance on three public RGB-D VSOD datasets, outperforming prior methods such as KAN-SAM across all reported metrics:

Dataset	Metric	Best Baseline	SAM-DAQ (Ours)
RDVS	E_ξ↑	0.888	0.913
	S_α↑	0.854	0.879
	F_β↑	0.791	0.827
	M↓	0.028	0.026
ViDSOD-100	E_ξ↑	0.912	0.918
	S_α↑	0.892	0.894
	F_β↑	0.846	0.868
	M↓	0.025	0.020
DViSal	E_ξ↑	0.885	0.914
	S_α↑	0.835	0.840
	F_β↑	0.783	0.818
	M↓	0.052	0.046

Ablation studies reveal:

Removing multi-modal DPAs reduces E_ξ from 0.913 to 0.899.
Sequential adapters or LoRA increase GPU memory from 21 GB to ≈95 GB without accuracy improvements.
Disabling QTM update decreases E_ξ by 3 percentage points, emphasizing the necessity of temporal queries.

A plausible implication is that prompt-free depth-guided adapters and query-driven temporal memory are critical for accurate and efficient VSOD.

8. Limitations and Prospective Extensions

Current SAM-DAQ limitations include:

System reliance on depth data quality; noisy depth maps degrade fusion and segmentation.
Fixed query counts may be suboptimal for video segments with highly variable object numbers.
The present framework focuses on single salient object segmentation; multi-object scenarios are not explicitly addressed.

Future research may explore:

Adaptive query generation to accommodate dynamic object counts.
Mechanisms to down-select noisy depth frames or assign confidence weights to depth cues.
Extension of QTM to instance-level queries for multi-object segmentation.
Compression or sparsification of DPAs for real-time deployment in VSOD tasks.

These directions offer potential for enhancing the scalability and robustness of SAM-DAQ in complex video understanding applications.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ).