PAMIE: Parallel Adapter-based Image Encoder

Updated 15 November 2025

PAMIE is a module for multi-modal fusion in RGB-D video salient object detection, using parallel depth-guided adapters to integrate spatial cues.
It freezes the SAM2 encoder and introduces lightweight adapters alongside transformer blocks to significantly reduce GPU memory consumption.
Experiments demonstrate PAMIE's effectiveness with improved accuracy metrics on RGB-D benchmarks while enabling prompt-free training.

The Parallel Adapter-based Multi-modal Image Encoder (PAMIE) is a module designed for efficient multi-modal feature fusion and adaptation in the context of RGB-D video salient object detection (VSOD). Developed as a central component of the SAM-DAQ architecture, PAMIE enables fine-tuning of a large, frozen vision foundation model (SAM2) for VSOD without manual prompts or prohibitive GPU memory consumption. PAMIE explicitly integrates depth cues at multiple scales through parallel, lightweight depth-guided adapters, optimizing the fusion of complementary spatial information from RGB and depth modalities.

1. Motivation and Design Principles

PAMIE targets three primary challenges in adapting vision foundation models like SAM2 for RGB-D VSOD: (i) eliminating dependence on manual prompts, (ii) mitigating excessive GPU memory usage associated with sequential adapters, and (iii) exploiting nuanced spatial cues from both RGB and depth modalities across scales. Previous approaches have relied on noisy pseudo-prompts or sequential adapter schemes (e.g., Houlsby adapters, LoRA), the latter requiring gradient backpropagation through the full encoder and incurring memory costs exceeding 90 GB on standard GPUs. Further, simplistic early/late fusion of depth and RGB fails to capitalize on their complementary features at multiple spatial resolutions.

PAMIE addresses these limitations by:

Freezing all weights in the SAM2 encoder
Inserting parallel, depth-guided adapters ("DPAs") alongside each transformer block in a skip-connection manner
Using depth features to guide multi-modal fusion within each adapter

This design supports prompt-free training and scalable deployment for large video datasets and enables effective utilization of foundation model representations for multi-modal tasks.

2. Architectural Overview

PAMIE operates within the SAM-DAQ framework and processes each video frame through parallel RGB and depth branches. Let $I_{RGB} \in \mathbb{R}^{H \times W \times 3}$ be the RGB frame and $I_D \in \mathbb{R}^{H \times W \times 1}$ be its co-registered depth map.

Depth Projection: The depth map is projected into the SAM feature space via a $1\times1$ "depth projector":

$F_D^0 = \mathrm{DepthProj}(I_D) \in \mathbb{R}^{H/4 \times W/4 \times C}$

where $C$ (e.g., $C=768$ in SAM-Large) is the encoder's hidden dimension.

Hierarchical Feature Extraction:

The frozen SAM encoder comprises four transformer blocks denoted $\mathrm{Hiera}^1,\dots,\mathrm{Hiera}^4$ . At each level $i$ : - Depth Branch:

$\tilde F_D^{\,i-1} = \mathrm{Adapter}(F_D^{\,i-1}), \quad F_D^{\,i} = \mathrm{Hiera}^{\,i}(F_D^{\,i-1}) + \mathrm{DownSample}(\tilde F_D^{\,i-1})$

RGB Branch with Depth Guidance:

$\tilde F_{RGB}^{\,i-1} = \mathrm{Adapter}([F_{RGB}^{\,i-1}, F_D^{\,i-1}]), \quad F_{RGB}^{\,i} = \mathrm{Hiera}^{\,i}(F_{RGB}^{\,i-1}) + \mathrm{DownSample}(\tilde F_{RGB}^{\,i-1})$

Here, $[\,\cdot,\cdot\,]$ denotes channel-wise concatenation.

At the first hierarchy ( $i=1$ ), the RGB branch omits the adapter:

$F_{RGB}^1 = \mathrm{Hiera}^1(F_{RGB}^0)$

Multi-scale Feature Generation: After $\mathrm{Hiera}^2$ , $\mathrm{Hiera}^3$ , and $\mathrm{Hiera}^4$ , feature maps are extracted via feature pyramid fusion:

$E_I^i = \mathrm{FPN}^i(F_{RGB}^i) \in \mathbb{R}^{H/2^i \times W/2^i \times C_i},\quad i=2,3,4$

Each $E_I^i$ feeds subsequent temporal modeling and intermediate supervision modules.

3. Depth-guided Parallel Adapters (DPAs)

Each Adapter module in PAMIE comprises a two-layer bottleneck MLP:

DownProj: $\mathbb{R}^{C'} \rightarrow \mathbb{R}^{C'/r}$
Activation: GELU non-linearity
UpProj: $\mathbb{R}^{C'/r} \rightarrow \mathbb{R}^{C'}$

Parameter $C'$ corresponds to the input channel dimension at each level, and the bottleneck factor $r$ is set to $4$ by default. For levels $i=2,3,4$ , $C' = \{96,192,384,768\}$ , resulting in bottleneck hidden dimensions of $\{24,48,96,192\}$ . Only DPAs and the DepthProj are trained, while the SAM-L encoder remains entirely frozen.

This parallel adapter configuration—rather than sequential insertion—promotes memory-efficient fine-tuning and facilitates direct fusion of RGB and depth features using spatial context at each scale.

4. Training and Efficiency Considerations

PAMIE dramatically reduces training resource requirements compared to traditional adapter-based PEFT schemes. With a total of $\approx$ 19.2 M trainable parameters (out of 237.9 M) and batch size $1$, optimizer AdamW (lr $=1\mathrm{e}{-4}$ , wd $=0.05$ ), and input resolution $1024 \times 1024$ , a typical training session (2000 iterations) completes in approximately 3 hours on a single RTX 3090 GPU.

Resource comparison from single 24 GB GPU training:

Strategy	Trainable/Total (M)	GPU Mem (GB)
Sequential Houlsby adapters	17.4 / 236.0	91.9
LoRA in each attention block	56.0 / 274.6	95.0
PAMIE	19.2 / 237.9	21.0

Despite freezing the encoder, PAMIE enables effective multi-modal fusion, reducing GPU memory requirements by approximately a factor of four compared to sequential adapters.

5. Performance Impact on RGB-D VSOD

Integrating PAMIE into SAM-DAQ produces high accuracy on three major RGB-D VSOD benchmarks (RDVS, ViDSOD-100, DViSal), outperforming the prior state-of-the-art KAN-SAM method. Quantitative improvements are as follows: average +1.5% in E-measure, +1.0% in S-measure, +2.4% in F-measure, and –0.003 in MAE.

Method	E $_\xi\uparrow$	S $_\alpha\uparrow$	F $_\beta\uparrow$	MAE $\downarrow$
KAN-SAM	.888/.912/.885	.854/.892/.835	.791/.846/.783	.028/.025/.052
SAM-DAQ (PAMIE)	.913/.918/.914	.879/.894/.840	.827/.868/.818	.026/.020/.046

This demonstrates that PAMIE's prompt-free, depth-guided parallel adapters enable memory-efficient utilization of SAM2's encoder in video salient object detection tasks.

6. Practical and Methodological Implications

PAMIE exemplifies the synergy of PEFT and multi-modal fusion strategies in large-scale video analysis. By freezing the base encoder and using parallel adapters, the technique drastically curtails memory usage while preserving high-performance feature fusion. The explicit depth guidance further ensures robust multi-scale spatial context modeling, which is critical for VSOD.

A plausible implication is that similar parallel adapter architectures may benefit other multi-modal, prompt-free extension tasks for foundational vision models, particularly where resource limitations or noisy manual prompt generation are obstacles.

PAMIE advances adapter-based PEFT methods by addressing notorious runtime bottlenecks in transformer-based vision models. Its design rationales respond directly to the limitations observed in prompt-free fine-tuning approaches and sequential adapter strategies for segmentation models. The approach aligns with the growing emphasis on leveraging frozen foundation models for downstream multi-modal tasks, while minimizing resource footprints and manual intervention.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to Parallel Adapter-based Multi-modal Image Encoder (PAMIE).