KAN-SAM: RGB-T Salient Object Detection
- The paper presents KAN-SAM, which employs a minimal fine-tuning strategy (less than 1% of parameters) by integrating Kolmogorov-Arnold adapters into a frozen SAM2 backbone for RGB-T salient object detection.
- KAN-SAM uses a mutually exclusive random masking strategy combined with spline-based fusion to ensure effective cross-modal feature integration and robust model generalization.
- The method demonstrates superior performance on VT5000 benchmarks, achieving an average F-measure of 0.909 while reducing computational overhead compared to conventional MLP prompts.
KAN-SAM is a prompt learning-based RGB-thermal salient object detection (RGB-T SOD) method that integrates Kolmogorov-Arnold Network (KAN) adapters into the Segment Anything Model 2 (SAM2) backbone. It leverages the Kolmogorov-Arnold superposition theorem to enable modular, parameter-efficient fusion of thermal prompts with RGB features in large-scale, pretrained image segmentation models. By employing a mutually exclusive random masking (MERM) strategy and tuning less than 1% of the original model's weights, KAN-SAM achieves robust performance on RGB-T SOD benchmarks, demonstrating both computational efficiency and improved generalization in multi-modal settings (Li et al., 8 Apr 2025).
1. Architecture and Integration with SAM2
KAN-SAM operates by minimally adapting the frozen SAM2 backbone for the RGB-T SOD task. The input consists of a registered RGB image and an aligned single-channel thermal image. These inputs are subjected to a mutually exclusive random masking (MERM) operation, which selectively occludes 10% of pixels in only one modality per pixel, preventing co-masking and encouraging cross-modal reasoning.
The images are divided into non-overlapping patches and linearly projected to D-dimensional tokens via a frozen patch embedding module. In the image encoder stages, frozen hierarchical transformer (or CNN) layers process RGB tokens, while thermal features are processed through lightweight convolutional and interpolation operations to produce transformer-compatible representations. At four major encoder stages, KAN adapters merge thermal and RGB features, resulting in “thermally prompted” RGB tokens. These fused features propagate to a feature pyramid network for multi-scale aggregation, and finally, a mask decoder (the only other trainable module) predicts the saliency map. The following schematic abstracts the architectural flow:
Input → MERM → PatchEmbed → [Encoder/Stage ℓ: RGB + KAN-prompted thermal fusion] → FPN → MaskDecoder → Saliency map
2. Kolmogorov-Arnold Network (KAN) Adapters: Mathematical Foundation and Implementation
KAN adapters operationalize the Kolmogorov-Arnold superposition theorem, which guarantees that any continuous multivariate function can be represented as a finite sum of univariate functions applied to linear combinations of the original variables. In KAN-SAM, each adapter is a learnable module consisting of a matrix of univariate spline functions:
where and are spline-parameterized univariate functions.
At each encoder stage, thermal embeddings are projected via 1×1 convolutions and resized, then injected into the RGB feature flow via a two-step KAN prompt:
- Pre-prompt: Thermal is added or concatenated to RGB tokens.
- KAN fusion: Transformation by spline adapters as
- Post-prompt: Output fused again with original RGB features using concatenation and 1×1 convolutions.
KAN adapters are deployed at four encoder stages, contributing approximately 1.868 million parameters and adding 3 GFLOPs in overhead—comparatively less than traditional MLP prompts by both parameter and computational metrics.
3. Prompt Learning and Parameter Efficiency
KAN-SAM adopts a prompt-learning paradigm in which the only trainable parameters are those in the suite of KAN adapters and the mask decoder head. The entire SAM2 backbone (641.7 million parameters) and all patch embedding, FPN, and transformer/CNN encoder layers remain frozen. The total number of fine-tuned parameters is million ( of the model), distributed as follows:
| Module | Parameters (M) |
|---|---|
| KAN Adapters | 1.868 |
| Mask Decoder | 4.213 |
This selective fine-tuning enables rapid adaptation to new modalities or tasks while retaining large-scale visual prior knowledge. The mask decoder is trained with a loss combining region-level Intersection over Union (IoU) and pixel-level Dice loss: with standard definitions for each term.
4. Mutually Exclusive Random Masking (MERM) for Cross-Modal Generalization
The MERM module is a pixel-level masking mechanism applied during training:
- For every pixel location, with ( Uniform[0,1]), either RGB or thermal intensity is set to zero, with the other unchanged.
- This process ensures 10% masking per sample, with no overlap—no pixel is masked in both channels.
Empirical results indicate that MERM alone slightly decreases performance due to information loss, but in combination with KAN adapters (the full KAN-SAM model), it maximizes generalization by enforcing reliance on both modalities.
5. Training Procedure and Experimental Protocol
KAN-SAM is trained on VT5000 (2,500 pairs for training), evaluated on the remainder of VT5000 and all of VT1000 and VT821, with all images resized to 0 and processed using standard augmentation (flips, rotations, crops). Optimization employs AdamW (lr=1, weight decay 2, batch size 8, momentum 0.9, gradient clipping 0.5) for 30 epochs, requiring approximately 6 hours on two RTX 4090 GPUs.
Only the KAN adapters and the mask decoder are updated; all backbone layers remain frozen throughout.
6. Performance Metrics and Comparative Results
KAN-SAM outperforms contemporary RGB-T SOD methods across VT5000, VT1000, and VT821 in terms of average F-measure (3), maximum F-measure (4), weighted F-measure (5), mean absolute error (MAE), enhanced-alignment (6), and structure-measure (7):
| Dataset | 8 | 9 | 0 | MAE | 1 | 2 |
|---|---|---|---|---|---|---|
| VT5000 | 0.909 | 0.931 | 0.905 | 0.020 | 0.957 | 0.927 |
Ablation studies confirm that including both MERM and KAN adapters yields the highest performance (3 on VT5000), outperforming both the SAM2 baseline (4) and other state-of-the-art methods (e.g., ADNet 5) (Li et al., 8 Apr 2025).
7. Insights, Limitations, and Future Directions
KAN-SAM demonstrates that spline-based, interpretable prompt adapters can effectively inject cross-modal semantic cues into large frozen segmentation backbones while drastically reducing trainable parameters and computational overhead (−32.8% parameters, −50% FLOPs vs. baseline MLP prompts). The MERM mechanism further enforces robust multi-modal saliency learning by preventing modality collapse.
Limitations include the significant overall model size (641M parameters due to frozen SAM2 backbone) and the restriction of experiments to three RGB-T SOD benchmarks. The data suggests future directions such as development of lightweight backbones, expansion to more diverse scenarios, or generalization of KAN-based prompt learning to foundational models beyond SAM2.
References:
- "KAN-SAM: Kolmogorov-Arnold Network Guided Segment Anything Model for RGB-T Salient Object Detection" (Li et al., 8 Apr 2025)