Papers
Topics
Authors
Recent
2000 character limit reached

SAM-DAQ: Segment Anything with Depth-adaptive Queries

Updated 15 November 2025
  • The paper introduces a prompt-free adaptation of a frozen SAM2 encoder using parallel depth-guided adapters and a query-driven temporal memory module to reduce manual prompts and GPU overhead.
  • It employs a multi-modal image encoder that fuses RGB and depth features via skip-connected adapters, enhancing spatial-temporal segmentation accuracy.
  • The framework achieves state-of-the-art performance on RGB-D VSOD benchmarks with measurable improvements in metrics like Eξ, Sα, Fβ, and reduced memory demands.

The Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ) is a deep learning framework designed for RGB-D Video Salient Object Detection (VSOD), the problem of identifying moving, visually attractive objects in video sequences that contain both color (RGB) and depth (D) information for each frame. SAM-DAQ builds upon the Segment Anything Model (SAM2), overcoming three core limitations of state-of-the-art approaches: dependence on manual prompts, high GPU memory requirements from adapter architectures, and the computational expense of temporal attention across large memory banks. The framework introduces a prompt-free, parallel adapter-based multi-modal image encoder and a query-driven temporal memory module, achieving high performance and efficiency for RGB-D VSOD tasks.

1. Motivation and Key Innovations

RGB-D VSOD requires accurate spatial-temporal segmentation leveraging both color and depth features across frames. Existing vision foundation models like SAM or SAM2 excel at universal image segmentation but typically rely on manual point or box prompts at inference, which is impractical for automated video tasks. They also incur significant computational costs when adapted to sequential input streams, primarily due to memory-intensive adapter networks and attention over large external memory banks.

SAM-DAQ provides three fundamental advancements:

  1. Prompt-free fine-tuning: Adapts a frozen SAM2 encoder using parallel depth-guided adapters (DPAs), eliminating the need for manual prompts and reducing memory overhead.
  2. Depth-guided multi-modal fusion: Integrates RGB and depth cues via a skip-connected, adapter-based fusion architecture for enhanced spatial representation.
  3. Query-driven temporal memory (QTM): Replaces large external memory banks and prompt embeddings with learnable frame-level and video-level queries, enabling efficient exploitation of temporal consistency.

These mechanisms enable SAM-DAQ to pop-out salient objects automatically in RGB-D video streams.

2. Architecture and Workflow

The overall SAM-DAQ pipeline consists of sequential modules that process RGB-D frames and generate segmentation predictions for each video frame. The core architectural blocks are as follows:

  1. Input: Each frame consists of an RGB image I0R3×H×WI_0 \in \mathbb{R}^{3\times H\times W} and a depth map IDR1×H×WI_D \in \mathbb{R}^{1\times H\times W}.
  2. PAMIE (Parallel Adapter-based Multi-modal Image Encoder):
    • Employs a frozen SAM2 encoder (Hiera) at four spatial scales.
    • Utilizes a depth projector and two parallel DPAs (depth and multi-modal) inserted in a skip-connection configuration at each Hiera block.
    • Produces multi-scale embeddings {EI2,EI3,EI4}\{E_I^2, E_I^3, E_I^4\}.
  3. QTM (Query-Driven Temporal Memory):
    • Static frame-level queries QfRNf×dqQ_f \in \mathbb{R}^{N_f \times d_q} attend to encoder outputs at each frame.
    • Video-level queries QvRNv×dqQ_v \in \mathbb{R}^{N_v \times d_q} undergo cross-attention with frame embeddings, then are iteratively updated.
    • Outputs concatenated learnable embeddings EL={Ef,Q~v}E_L = \{E_f, \tilde Q_v\}.
  4. Mask Decoder: Ingests multi-scale encoder outputs and learnable queries to produce per-frame segmentation PtP_t.
  5. Memory Update Module: Encodes (EI4,Pt)(E_I^4, P_t) from the current frame to update QvQ_v for subsequent frames.

All SAM2 encoder parameters remain frozen during training, with gradients flowing only through the adapters, queries, memory updates, and decoder.

3. Depth-Guided Parallel Adapters and Feature Fusion

The PAMIE module integrates DPAs in a parallel, skip-connected manner for efficient multi-modal fusion:

  • Depth Adapter: Processes depth features FDi1F_D^{i-1} with a down-projection, nonlinearity (GELU or ReLU), and up-projection, yielding F~Di1\tilde F_D^{i-1}.
  • Multi-modal Adapter: Concatenates RGB and depth features [FRGBi1;FDi1][F_{RGB}^{i-1}; F_D^{i-1}] before adapter transformation to produce F~RGBi1\tilde F_{RGB}^{i-1}.

Feature fusion is formalized as:

Ffusedi=Hierai(FRGBi1)+DS(Adapter([FRGBi1,FDi1]))F_{\text{fused}}^i = \text{Hiera}^i(F_{RGB}^{i-1}) + \text{DS}\Big(\text{Adapter}\big([F_{RGB}^{i-1}, F_D^{i-1}]\big)\Big)

where DS denotes down-sampling to ensure matching spatial resolutions.

Internally, each adapter comprises:

  • Linear down-projection (dindmidd_{in} \to d_{mid})
  • Activation (GELU/ReLU)
  • Linear up-projection (dmiddind_{mid} \to d_{in})
  • Optional layer normalization before skip addition

The parallel DPA strategy maintains low GPU memory usage (21 GB), compared to sequential adapters or LoRA, which can require approximately 95 GB.

4. Query-Driven Temporal Memory (QTM) Mechanism

The QTM module unifies temporal memory and prompt embeddings through learnable queries:

  • Initialization: Frame-level queries QfQ_f are randomly initialized and shared for the duration of each video; video-level queries QvQ_v are either learned per video or globally.
  • Forward Pass:
  1. Project queries via learned weights: Qf=WfQQfQ'_f = W_f^Q\, Q_f, Qv=WvQQvQ'_v = W_v^Q\, Q_v
  2. Frame-wise attention: Ef=Linear(QfEI4)E_f = \text{Linear}(Q'_f\, E_I^4)
  3. Video-level cross-attention: Q~v=CrossAttn(Qv,Ef)+Qv\tilde Q_v = \text{CrossAttn}(Q'_v, E_f) + Q'_v
  4. Concatenate EfE_f and Q~v\tilde Q_v, forming ELE_L
  • Temporal Update:

Memory feature extraction utilizes the mask prediction and encoder output:

Fm=Wm(MemEnc(EI4,Pt))F_m = W_m\, (\text{MemEnc}(E_I^4, P_t))

Video-level queries are updated:

Qv,t+1=Qv,t+FFN(SelfAttn(CrossAttn(Qv,t,Fm)))Q_{v, t+1} = Q_{v, t} + \text{FFN}\Bigl(\text{SelfAttn}\bigl(\text{CrossAttn}(Q_{v, t}, F_m)\bigr)\Bigr)

This configuration allows efficient selection and updating of salient temporal context, outperforming static memory bank attention approaches while maintaining a much smaller memory footprint.

5. Training Protocol and Loss Functions

SAM-DAQ employs strict prompt-free training:

  • All SAM2 encoder weights are fixed.
  • Parameters of DPAs, query embedding layers, memory update layers, and the mask decoder are trainable.
  • Optimization uses AdamW (learning rate 1e41e^{-4}, weight decay $0.05$), batch size 1 (sampling 10 frames per video per epoch), total 2,000 iterations. Frame resolution is 1,024×1,0241,024 \times 1,024.
  • Data augmentation is limited to random frame sampling and resizing.

Losses comprise:

  • Final prediction: Lpred=BCE(Pt,GTt)\mathcal{L}_{\mathrm{pred}} = \mathrm{BCE}(P_t, GT_t)
  • Intermediate supervision at highest embedding level: Linter=BCE(P~4,GTt)\mathcal{L}_{\mathrm{inter}} = \mathrm{BCE}(\tilde P^4, GT_t)
  • Total: Ltotal=λ1Lpred+λ2Linter\mathcal{L}_{\mathrm{total}} = \lambda_1 \mathcal{L}_{\mathrm{pred}} + \lambda_2 \mathcal{L}_{\mathrm{inter}}, with λ1=1.0\lambda_1=1.0, λ2=0.4\lambda_2=0.4

6. Algorithmic Workflow

The training and inference loop can be described as:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
Given: Video frames {(I_{RGB,t}, I_{D,t})}_{t=1}^T, ground-truth masks {GT_t}
Initialize: Q_f, Q_v randomly; frozen SAM2 encoder; trainable DPAs, query layers, mask decoder.
for each training iteration do
  Sample a video; pick frames t=1T sampled in batch.
  for each frame t in batch do
    # 1. Multi-modal encoding (PAMIE)
    D_0  LinearDepthProj(I_{D,t})
    R_0  PatchEmbed(I_{RGB,t})
    for i=14 do
      tildeD_{i-1}  Adapter_D(D_{i-1})
      D_i  Hiera^i(D_{i-1}) + DS(tildeD_{i-1})
      tildeR_{i-1}  Adapter_RGB([R_{i-1}; D_{i-1}])
      R_i  Hiera^i(R_{i-1}) + DS(tildeR_{i-1})
    end
    {E_I²,E_I³,E_I}  FPN({R,R,R})

    # 2. Query-Driven Memory (QTM)
    Q'_f ← W_f^Q Q_f;  Q'_v  W_v^Q Q_v
    E_f   Linear( Q'_f · E_I⁴ )
    Q_v̂  CrossAttn(Q'_v, E_f) + Q'_v
    E_L   concat(E_f, Q_v̂)

    # 3. Mask decoding
    P_t  MaskDecoder(E_I²,E_I³,E_I, E_L)

    # 4. Loss computation
    L_pred   BCE(P_t, GT_t)
    L_inter  BCE(sigmoid(Conv(E_I)), GT_t)
    L_total  L_pred + 0.4·L_inter

    # 5. Backward (only DPAs, QTM layers, decoder)
    θ  L_total / θ;  θ  θ  lr·AdamW(θ)

    # 6. Memory update for next frame
    F_m    Linear( MemEnc(E_I, P_t) )
    Q_v    Q_v + FFN(SelfAttn(CrossAttn(Q_v, F_m)))
  end
end

7. Experimental Evaluation and Ablation Outcomes

SAM-DAQ achieves state-of-the-art performance on three public RGB-D VSOD datasets, outperforming prior methods such as KAN-SAM across all reported metrics:

Dataset Metric Best Baseline SAM-DAQ (Ours)
RDVS E_ξ↑ 0.888 0.913
S_α↑ 0.854 0.879
F_β↑ 0.791 0.827
M↓ 0.028 0.026
ViDSOD-100 E_ξ↑ 0.912 0.918
S_α↑ 0.892 0.894
F_β↑ 0.846 0.868
M↓ 0.025 0.020
DViSal E_ξ↑ 0.885 0.914
S_α↑ 0.835 0.840
F_β↑ 0.783 0.818
M↓ 0.052 0.046

Ablation studies reveal:

  • Removing multi-modal DPAs reduces E_ξ from 0.913 to 0.899.
  • Sequential adapters or LoRA increase GPU memory from 21 GB to ≈95 GB without accuracy improvements.
  • Disabling QTM update decreases E_ξ by 3 percentage points, emphasizing the necessity of temporal queries.

A plausible implication is that prompt-free depth-guided adapters and query-driven temporal memory are critical for accurate and efficient VSOD.

8. Limitations and Prospective Extensions

Current SAM-DAQ limitations include:

  • System reliance on depth data quality; noisy depth maps degrade fusion and segmentation.
  • Fixed query counts may be suboptimal for video segments with highly variable object numbers.
  • The present framework focuses on single salient object segmentation; multi-object scenarios are not explicitly addressed.

Future research may explore:

  • Adaptive query generation to accommodate dynamic object counts.
  • Mechanisms to down-select noisy depth frames or assign confidence weights to depth cues.
  • Extension of QTM to instance-level queries for multi-object segmentation.
  • Compression or sparsification of DPAs for real-time deployment in VSOD tasks.

These directions offer potential for enhancing the scalability and robustness of SAM-DAQ in complex video understanding applications.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Segment Anything Model with Depth-guided Adaptive Queries (SAM-DAQ).