Depth-guided Parallel Adapters (DPAs)

Updated 16 May 2026

DPAs are lightweight modules that integrate RGB and depth features, enhancing multi-modal fusion in vision transformers such as SAM.
They employ parallel, skip-connected adapter architectures and two-layer MLP bottlenecks to inject trainable corrections efficiently.
Empirical results show that DPAs improve segmentation accuracy and reduce memory usage, enabling prompt-free adaptation for challenging tasks.

Depth-guided Parallel Adapters (DPAs) are lightweight architectural modules designed for efficient multi-modal feature fusion between RGB and depth representations, particularly within powerful frozen vision transformers such as the Segment Anything Model (SAM). They are introduced to overcome the inherent limitations of foundation vision models in tasks like camouflaged object detection (COD) and RGB-D video salient object detection by leveraging complementary spatial and geometric cues from depth without compromising memory or computation overhead. DPAs are mounted in parallel, skip-connection fashion to the main backbone, enabling prompt-free and parameter-efficient adaptation to new tasks while significantly enhancing segmentation accuracy on complex benchmarks (Liu et al., 8 Mar 2025, Lin et al., 13 Nov 2025).

1. Parallel Adapter Architecture and Integration

Depth-guided Parallel Adapters are inserted alongside frozen transformer backbones (e.g., ViT in SAM, or Hiera in SAM2). Their design is characterized by parallel branches for RGB and depth (or for their concatenation), each flowing in parallel and interfacing through skip connections with the frozen feature extractors.

In one paradigm (Liu et al., 8 Mar 2025), the architecture maintains separate RGB and depth streams with dedicated adapter modules for each modality. For a given block ℓ, the process is:

Each modality input $X_{\ell-1}^m$ ( $m\in\{\text{RGB},\text{Depth}\}$ ), with $X\in\mathbb{R}^{N\times C}$ .
Adapter is a two-layer MLP bottleneck $(C\to d\to C)$ applied to modality-specific high-frequency features extracted via 2D Haar Discrete Wavelet Transform (DWT).
Output is added residually to the input, yielding refined $X_\ell^m$ for the next block.

Alternatively, in a multi-modal fusion context (Lin et al., 13 Nov 2025), the DPA concatenates $F_\text{RGB}^{i-1}$ and $F_D^{i-1}$ along channel dimensions. The concatenated tensor launches into a bottleneck adapter (Linear↓–GeLU–Linear↑), followed optionally by spatial downsampling. The output is then summed with the corresponding frozen backbone stage output, yielding $F_\text{RGB}^i = \mathrm{Hiera}^i(F_\text{RGB}^{i-1}) + \mathrm{DS}(\widetilde F_\text{RD}^{\,i-1})$ .

2. Mathematical Formulation and Bottleneck Design

The essential function of a DPA is to inject trainable, low-rank corrections into frozen feature maps via lightweight bottlenecked projections:

Channel fusion stage: $\mathrm{Cat}(F_\text{RGB}^{i-1}, F_D^{i-1})$
Down-projection: $2C_i \to C_i/r$ (bottleneck, $m\in\{\text{RGB},\text{Depth}\}$ 0)
Nonlinearity: GeLU (or ReLU, depending on variant)
Up-projection: $m\in\{\text{RGB},\text{Depth}\}$ 1
Residual addition: output is added to the transformer's output at each targeted layer.

In SAM-COD (Liu et al., 8 Mar 2025), the adapter operates directly on high-frequency content extracted by DWT. For each token embedding: $m\in\{\text{RGB},\text{Depth}\}$ 2 where $m\in\{\text{RGB},\text{Depth}\}$ 3, $m\in\{\text{RGB},\text{Depth}\}$ 4, and $m\in\{\text{RGB},\text{Depth}\}$ 5 are the Haar high-frequency subbands. This processed map is then bottlenecked and residually added back.

In SAM-DAQ (Lin et al., 13 Nov 2025), no explicit frequency separation is performed; instead, direct spatial fusion via concatenation and MLP bottleneck suffices, enabling general applicability to video-salient detection.

3. Placement and Skip-connection Strategies

DPAs are always deployed in parallel to frozen transformer blocks, ensuring:

Gradients flow exclusively through the small adapter modules and their adjacent normalization layers (not through the backbone), containing training-time activation memory.
Both per-modality processing (as in dual-stream) and cross-modal fusion (as in concatenation-adapter) are possible depending on downstream task requirements.
In practice, DPAs are placed after the first block and replicated at each deeper stage/block, enabling hierarchical depth-guided refinement.

In the Parallel Adapter-based Multi-modal Image Encoder (PAMIE) of SAM-DAQ, DPAs appear at stages $m\in\{\text{RGB},\text{Depth}\}$ 6 of the Hiera encoder. Each DPA adds its output to the corresponding backbone stage, facilitating prompt-free training and streamlined inference.

4. Training Paradigms and Loss Functions

The DPA-empowered frameworks are optimized using task-specific loss functions, often jointly supervising both RGB and depth predictions:

In SAM-COD (Liu et al., 8 Mar 2025): Joint Dice and Cross-Entropy segmentation losses per stream, plus a KL divergence-based distillation objective to enable bidirectional knowledge transfer. The total loss combines segmentation ( $m\in\{\text{RGB},\text{Depth}\}$ 7) and distillation ( $m\in\{\text{RGB},\text{Depth}\}$ 8), typically as: $m\in\{\text{RGB},\text{Depth}\}$ 9
KL distillation is employed both from a teacher model (e.g., pretrained PVTv2 for RGB) and between streams (RGB $X\in\mathbb{R}^{N\times C}$ 0 Depth) to align representations and promote stronger multi-modal coupling.
In SAM-DAQ (Lin et al., 13 Nov 2025), training is prompt-free: the DPAs serve as the mechanism for injecting “self-prompted” depth cues into a frozen SAM2 encoder, requiring no manual box/point annotations.

The only trainable parameters are those of (a) DPAs (adapter MLPs), (b) adjacent normalization, (c) prompt encoder convs (in mask prediction), and (d) mask decoders.

5. Empirical Effectiveness and Efficiency

Both SAM-COD and SAM-DAQ demonstrate that DPAs yield significant performance improvements with modest parameter overhead:

On COD10K, deploying dual-stream DPAs with bidirectional knowledge distillation and mixed-prompt embedding leads to $X\in\mathbb{R}^{N\times C}$ 1, $X\in\mathbb{R}^{N\times C}$ 2, $X\in\mathbb{R}^{N\times C}$ 3, outperforming vanilla SAM by up to +10.4% $X\in\mathbb{R}^{N\times C}$ 4 (Liu et al., 8 Mar 2025).
Ablations reveal that DPAs alone account for $X\in\mathbb{R}^{N\times C}$ 5 $X\in\mathbb{R}^{N\times C}$ 6 gain, while knowledge distillation and hybrid prompting contribute further additive improvements.
In SAM-DAQ, full DPA-equipped PAMIE achieves $X\in\mathbb{R}^{N\times C}$ 7, outperforming variants lacking depth, parallelization, or multi-modal fusion by notable margins. Memory usage remains as low as $X\in\mathbb{R}^{N\times C}$ 8 GB, compared to >90 GB for sequential or LoRA-based alternatives (Lin et al., 13 Nov 2025).

Quantitative ablation summary from (Lin et al., 13 Nov 2025):

Adapter Variant	Trainable/Total Params	GPU Mem (GB)	$X\in\mathbb{R}^{N\times C}$ 9 $(C\to d\to C)$ 0	%%%%3 $m\in\{\text{RGB},\text{Depth}\}$ 3%%%%2	$(C\to d\to C)$ 3 $(C\to d\to C)$ 4	$(C\to d\to C)$ 5 $(C\to d\to C)$ 6
w/o depth projector	— / 237.9M	20.3	0.899	0.870	0.808	0.023
w/o parallel (sequential)	17.4M / 236.0M	91.9	0.860	0.830	0.778	0.028
w/o parallel (LoRA)	56.0M / 274.6M	95.0	0.889	0.877	0.824	0.027
w/o multi-modal fusion	— / 237.9M	17.9	0.876	0.853	0.782	0.029
Ours (parallel DPA + depth)	19.2M / 237.9M	21.0	0.913	0.879	0.827	0.026

6. Design Implications and Practical Considerations

DPAs add only a small overhead to model size (typically a few percent per block, ∼8% total in SAM-DAQ), making them highly appealing for large-scale deployments. Gradient routing solely through adapters minimizes computational overhead, enabling high-resolution (e.g., $(C\to d\to C)$ 7) training on commodity GPUs. The parallel, skip-connection topology is central for limiting backward-pass memory.

DPAs facilitate scenarios that previously suffered from foundation model inflexibility:

Robust prompt-free adaptation, where no manual annotation is required, as the adapters themselves “guide” the backbone with depth.
Parameter-efficient specialization for modalities like RGB-D, especially critical in tasks like camouflaged or salient object detection where spatial/geometric cues are complementary.

A plausible implication is that DPA-style adapters may generalize to audio-visual or multi-sensor fusion domains with similar design logic.

DPAs mark a distinct trend toward minimally invasive multi-modal adaptation strategies for frozen vision transformers. Rather than retraining or unfreezing large pre-trained backbones, performance-critical applications can be realized by injecting highly specialized adapters in parallel, demonstrating high sample efficiencies and superior performance with minimal hardware requirements (Liu et al., 8 Mar 2025, Lin et al., 13 Nov 2025).

The explicit demonstration of prompt-free, depth-guided fine-tuning establishes a new baseline for resource-constrained and annotation-limited environments. DPAs' design also avoids the marked performance/memory trade-offs observed for sequential adapters or LoRA-based solutions, as confirmed by ablation studies.

These results situate DPAs as a compelling architectural primitive for vision foundation models, with empirical evidence for their criticality in bridging inter-modality gaps and achieving state-of-the-art segmentation.

Markdown Report Issue Upgrade to Chat

References (2)

Improving SAM for Camouflaged Object Detection via Dual Stream Adapters (2025)

SAM-DAQ: Segment Anything Model with Depth-guided Adaptive Queries for RGB-D Video Salient Object Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Depth-guided Parallel Adapters (DPAs).