M$^4$-SAM: Multi-Modal Mixture-of-Experts with Memory-Augmented SAM for RGB-D Video Salient Object Detection

Published 12 May 2026 in cs.CV | (2605.11760v1)

Abstract: The Segment Anything Model 2 (SAM2) has emerged as a foundation model for universal segmentation. Owing to its generalizable visual representations, SAM2 has been successfully applied to various downstream tasks. However, extending SAM2 to the RGB-D video salient object detection (RGB-D VSOD) task encounters three challenges including limited spatial modeling of linear LoRA, insufficient employment of SAM's multi-scale features, and dependence of initialization on explicit prompts. To address the issues, we present Multi-Modal Mixture-of-Experts with Memory-Augmented SAM (M$^4$-SAM), which equips SAM2 with modality-related PEFT, hierarchical feature fusion, and prompt-free memory initialization. Firstly, we inject Modality-Aware MoE-LORA, which employs convolutional experts to encode local spatial priors and introduces a modality dispatcher for efficient multi-modal fine-tuning, into SAM2's encoder. Secondly, we deploy Gated Multi-Level Feature Fusion, which hierarchically aggregates multi-scale encoder features with an adaptive gating mechanism, to balance spatial details and semantic context. Finally, to conduct zero-shot VSOD without manual prompts, we utilize a Pseudo-Guided Initialization, where a coarse mask is regarded as a pseudo prior and used to bootstrap the memory bank. Extensive experiments demonstrate that M$^4$-SAM achieves the state-of-the-art performance across all evaluation metrics on three public RGB-D VSOD datasets.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes a modality-aware PEFT strategy integrating convolutional MoE-LoRA for efficient multi-modal adaptation in RGB-D VSOD.
It employs hierarchical feature fusion using universal interaction modules and gated receptive blocks to enhance spatial precision and temporal consistency.
Experimental results show significant improvements in E-measure and MAE across datasets, outperforming SAM-based baselines.

Overview and Motivation

M $^4$ -SAM addresses the adaptation of foundation segmentation models, specifically SAM2, to RGB-D video salient object detection (VSOD). Traditional VSOD methods using RGB-D modalities are constrained by dataset scale and limited generalization; moreover, direct application of SAM2 encounters three major challenges: restricted spatial modeling in linear LoRA, lack of exploitation of multi-scale features, and dependency on explicit prompts for memory initialization. M $^4$ -SAM overcomes these technical issues via modality-aware parameter-efficient fine-tuning (PEFT) using convolutional mixture-of-experts (MoE-LoRA), hierarchical feature fusion with gating, and a prompt-free memory initialization strategy using pseudo-guided mask bootstrapping. $Figure 1$

Figure 1: The overall architecture of M $^4$ -SAM, integrating modality-aware MoE-LoRA, hierarchical feature fusion, and prompt-free memory initialization for RGB-D VSOD.

Methodological Innovations

Modality-Aware MoE-LoRA Encoder

The encoder employs a shared Hiera backbone augmented with the Modality-Aware MoE-LoRA module. Each LoRA branch is replaced with convolutional experts of varying kernel sizes, introducing locality priors. The MoE Gating mechanism adaptively selects the top- $K$ experts based on input context; this is further extended with modality-specific grouping (RGB, depth, fusion), coordinated by a Modality Dispatcher, enabling unified and efficient RGB-D feature extraction. This design significantly reduces memory overhead in multi-modal adaptation and avoids redundant computation intrinsic to dual-encoder strategies.

Hierarchical Feature Fusion and Decoder

Multi-level RGB and depth features extracted by the encoder are fused through Universal Interaction Module (UIM) and Receptive Field Block (RFB) to produce unified multi-modal representations. The hierarchical decoder utilizes skip connections and upsampling, generating coarse segmentation masks and edge maps at multiple levels for auxiliary supervision, enhancing spatial precision and boundary delineation.

Pseudo-Guided Temporal Memory

M $^4$ -SAM implements a prompt-free temporal memory design that hierarchically aggregates multi-scale features via a Gated Multi-Level Feature Fusion module. Feature fusion is performed using gated weights that balance shallow and enhanced representations, subsequently concatenated with mid-level decoder features for temporal modeling. Cross-attention between current features and the memory bank facilitates consistent predictions across frames. Memory initialization is pseudo-guided: a coarse mask derived from early decoder layers is used to bootstrap the memory bank without explicit user prompts, exploiting attention-based affinity suppression to mitigate erroneous initialization. $Figure 2$

Figure 2: The Gated Multi-Level Feature Fusion module provides adaptive aggregation of multi-scale encoder features to optimize spatial-semantic balance.

Experimental Results

Extensive experimentation was conducted across three RGB-D VSOD datasets: DViSal, RDVS, and ViDSOD-100. Evaluation metrics include E-measure, S-measure, F-measure, and mean absolute error (MAE).

Numerical Highlights:

On DViSal, M $^4$ -SAM achieved E-measure of 0.925 and F-measure of 0.828, outperforming the second-best KAN-SAM by 4.5% and 5.7%.
On RDVS, E-measure reached 0.927, surpassing DCTNet+ by 2.0%.
On ViDSOD-100, E-measure and MAE were 0.936 and 0.016, with 2.6% and 0.009 improvements over KAN-SAM.
Compared to SAM-based baselines (MDSAM, SAM2-UNet, KAN-SAM), M $^4$ -SAM demonstrates average improvements of 6.9%, 7.6%, and 2.9% in E-measure, affirming that gains are not merely due to backbone selection.

Ablation Studies:

Depth modality contributions are validated: pseudo-depth inputs degrade E-measure by up to 8.1%.
Efficient PEFT strategies are compared: the proposed MoE-LoRA achieves superior accuracy and reduced training memory relative to LoRA, Adapter, Conv-LoRA.
Top- $K$ expert selection in MoE Gating is optimal at $K=2$ , maximizing performance by balancing specialization and diversity.
Gated Multi-Level Feature Fusion and mid-level decoder input to memory yield optimal temporal modeling.
Clip-length of $^4$ 0 frames achieves the best temporal context aggregation.
Figure 3: Qualitative comparison on challenging video sequences, demonstrating precise boundary preservation and robustness across diverse environments for M $^4$ 1-SAM versus prior SOTA models.

Implications and Future Directions

The M $^4$ 2-SAM framework sets a new operational paradigm for foundation model adaptation in multi-modal video settings. The modality-aware MoE-LoRA design provides an efficient pathway for fine-tuning large encoders across modalities, and the hierarchical fusion mechanism ensures preservation of fine spatial structure and semantic context, critical for dense prediction tasks. The pseudo-guided initialization strategy presents a scalable solution for prompt-free deployment, eliminating manual intervention while leveraging pseudo priors.

Practically, these advances enable robust salient object detection in real-world, unconstrained RGB-D video streams, beneficial for applications in robotics, surveillance, and automated visual analysis. Theoretically, the approach generalizes principles of PEFT into multi-modal fusion regimes, potentially extensible to other vision-language or sensor fusion tasks. Future directions may focus on further scaling generalization, domain adaptation, and application to compound tasks such as video grounding, multi-modal action recognition, and lifelong video understanding.

Conclusion

M $^4$ 3-SAM introduces a modality-aware, prompt-free adaptation of SAM2 for RGB-D VSOD, combining convolutional mixture-of-experts fine-tuning, gated hierarchical feature aggregation, and pseudo-guided temporal memory initialization. The model outperforms prior SOTA approaches across all major video RGB-D benchmarks and establishes a principled methodology for efficient, scalable multi-modal foundation model adaptation. The framework’s modularity and efficiency suggest promising avenues for future research in advanced multi-modal video tasks.