PAMIE: Parallel Adapter-based Image Encoder
- PAMIE is a module for multi-modal fusion in RGB-D video salient object detection, using parallel depth-guided adapters to integrate spatial cues.
- It freezes the SAM2 encoder and introduces lightweight adapters alongside transformer blocks to significantly reduce GPU memory consumption.
- Experiments demonstrate PAMIE's effectiveness with improved accuracy metrics on RGB-D benchmarks while enabling prompt-free training.
The Parallel Adapter-based Multi-modal Image Encoder (PAMIE) is a module designed for efficient multi-modal feature fusion and adaptation in the context of RGB-D video salient object detection (VSOD). Developed as a central component of the SAM-DAQ architecture, PAMIE enables fine-tuning of a large, frozen vision foundation model (SAM2) for VSOD without manual prompts or prohibitive GPU memory consumption. PAMIE explicitly integrates depth cues at multiple scales through parallel, lightweight depth-guided adapters, optimizing the fusion of complementary spatial information from RGB and depth modalities.
1. Motivation and Design Principles
PAMIE targets three primary challenges in adapting vision foundation models like SAM2 for RGB-D VSOD: (i) eliminating dependence on manual prompts, (ii) mitigating excessive GPU memory usage associated with sequential adapters, and (iii) exploiting nuanced spatial cues from both RGB and depth modalities across scales. Previous approaches have relied on noisy pseudo-prompts or sequential adapter schemes (e.g., Houlsby adapters, LoRA), the latter requiring gradient backpropagation through the full encoder and incurring memory costs exceeding 90 GB on standard GPUs. Further, simplistic early/late fusion of depth and RGB fails to capitalize on their complementary features at multiple spatial resolutions.
PAMIE addresses these limitations by:
- Freezing all weights in the SAM2 encoder
- Inserting parallel, depth-guided adapters ("DPAs") alongside each transformer block in a skip-connection manner
- Using depth features to guide multi-modal fusion within each adapter
This design supports prompt-free training and scalable deployment for large video datasets and enables effective utilization of foundation model representations for multi-modal tasks.
2. Architectural Overview
PAMIE operates within the SAM-DAQ framework and processes each video frame through parallel RGB and depth branches. Let be the RGB frame and be its co-registered depth map.
- Depth Projection: The depth map is projected into the SAM feature space via a "depth projector":
where (e.g., in SAM-Large) is the encoder's hidden dimension.
- Hierarchical Feature Extraction:
The frozen SAM encoder comprises four transformer blocks denoted . At each level : - Depth Branch:
- RGB Branch with Depth Guidance:
Here, denotes channel-wise concatenation.
At the first hierarchy (), the RGB branch omits the adapter:
- Multi-scale Feature Generation: After , , and , feature maps are extracted via feature pyramid fusion:
Each feeds subsequent temporal modeling and intermediate supervision modules.
3. Depth-guided Parallel Adapters (DPAs)
Each Adapter module in PAMIE comprises a two-layer bottleneck MLP:
- DownProj:
- Activation: GELU non-linearity
- UpProj:
Parameter corresponds to the input channel dimension at each level, and the bottleneck factor is set to $4$ by default. For levels , , resulting in bottleneck hidden dimensions of . Only DPAs and the DepthProj are trained, while the SAM-L encoder remains entirely frozen.
This parallel adapter configuration—rather than sequential insertion—promotes memory-efficient fine-tuning and facilitates direct fusion of RGB and depth features using spatial context at each scale.
4. Training and Efficiency Considerations
PAMIE dramatically reduces training resource requirements compared to traditional adapter-based PEFT schemes. With a total of 19.2 M trainable parameters (out of 237.9 M) and batch size $1$, optimizer AdamW (lr, wd), and input resolution , a typical training session (2000 iterations) completes in approximately 3 hours on a single RTX 3090 GPU.
Resource comparison from single 24 GB GPU training:
| Strategy | Trainable/Total (M) | GPU Mem (GB) |
|---|---|---|
| Sequential Houlsby adapters | 17.4 / 236.0 | 91.9 |
| LoRA in each attention block | 56.0 / 274.6 | 95.0 |
| PAMIE | 19.2 / 237.9 | 21.0 |
Despite freezing the encoder, PAMIE enables effective multi-modal fusion, reducing GPU memory requirements by approximately a factor of four compared to sequential adapters.
5. Performance Impact on RGB-D VSOD
Integrating PAMIE into SAM-DAQ produces high accuracy on three major RGB-D VSOD benchmarks (RDVS, ViDSOD-100, DViSal), outperforming the prior state-of-the-art KAN-SAM method. Quantitative improvements are as follows: average +1.5% in E-measure, +1.0% in S-measure, +2.4% in F-measure, and –0.003 in MAE.
| Method | E | S | F | MAE |
|---|---|---|---|---|
| KAN-SAM | .888/.912/.885 | .854/.892/.835 | .791/.846/.783 | .028/.025/.052 |
| SAM-DAQ (PAMIE) | .913/.918/.914 | .879/.894/.840 | .827/.868/.818 | .026/.020/.046 |
This demonstrates that PAMIE's prompt-free, depth-guided parallel adapters enable memory-efficient utilization of SAM2's encoder in video salient object detection tasks.
6. Practical and Methodological Implications
PAMIE exemplifies the synergy of PEFT and multi-modal fusion strategies in large-scale video analysis. By freezing the base encoder and using parallel adapters, the technique drastically curtails memory usage while preserving high-performance feature fusion. The explicit depth guidance further ensures robust multi-scale spatial context modeling, which is critical for VSOD.
A plausible implication is that similar parallel adapter architectures may benefit other multi-modal, prompt-free extension tasks for foundational vision models, particularly where resource limitations or noisy manual prompt generation are obstacles.
7. Connections to Related Research
PAMIE advances adapter-based PEFT methods by addressing notorious runtime bottlenecks in transformer-based vision models. Its design rationales respond directly to the limitations observed in prompt-free fine-tuning approaches and sequential adapter strategies for segmentation models. The approach aligns with the growing emphasis on leveraging frozen foundation models for downstream multi-modal tasks, while minimizing resource footprints and manual intervention.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free