MAFA: Multi-Scale Adaptive Filtering Adapter

Updated 3 July 2026

Multi-Scale Adaptive Filtering Adapter (MAFA) is a dedicated spatial-frequency module that enhances feature maps for light field salient object detection by suppressing noise in small object regions.
It integrates multi-scale convolutional filtering with patch-wise FFT operations and learned frequency masks to improve feature clarity and preserve fine object boundaries.
Empirical evaluations show that incorporating MAFA in the SPLF-SAM pipeline improves Fβ scores and reduces error metrics, demonstrating its efficacy on benchmark light field datasets.

The Multi-Scale Adaptive Filtering Adapter (MAFA) is a dedicated spatial-frequency processing module designed to enhance feature representations for downstream saliency segmentation, specifically addressing the suppression of noise in small object regions. Introduced as a core component of the SPLF-SAM architecture for light field salient object detection (LF SOD), MAFA systematically exploits both spatial and local frequency-domain filtering, enabling model robustness to background clutter while sharpening object boundaries. Within SPLF-SAM, MAFA directly interfaces between the frozen SAM encoder and the unified multi-scale feature embedding block (UMFEB) and decoder, forming a critical link that delivers clean, frequency-refined feature maps to the segmentation and self-prompting machinery (Xu et al., 27 Aug 2025).

1. Architectural Placement and Functional Role

MAFA occupies a central role in the SPLF-SAM pipeline. After the input is processed by the frozen Segment Anything Model (SAM) encoder, the resulting feature tensor $\bm{R}_i$ is first adapted through a point-wise multi-layer perceptron (MLP) to match the input dimensionality for the adapter. The feature map $F$ is then passed through MAFA, which transforms and filters the representation before forwarding it to the UMFEB and self-prompting decoder. This architectural arrangement ensures that the high-level features extracted by the encoder are subsequently disentangled from background noise, with particular focus on preserving small-object edges. The overall pipeline can be schematically represented as:

$C'$ 8

2. Layer-wise Operations and Dataflow

The MAFA is structured around parallel multi-scale convolutional filtering with frequency-domain adaptive masking:

Initial Adaptation: Each encoder feature map $F\in\mathbb{R}^{C\times H\times W}$ is transformed by a two-layer point-wise MLP to lift channels to $C'$ .
Multi-scale Spatial Filtering: The adapted tensor is split across $K=4$ depthwise convolutional branches with kernel sizes $K_k \in \{1,3,5,7\}$ , each maintaining spatial resolution and outputting $C'$ channels, followed by GELU activation:

$F_k = \mathrm{GELU}\bigl(\mathrm{Conv}_k(\mathrm{MLP}(F))\bigr).$

Patch-wise Frequency Filtering: Each branch output $F_k$ is partitioned into non-overlapping $8\times 8$ patches, denoted $F$ 0, upon which a 2D Fast Fourier Transform (FFT) is applied per patch:

$F$ 1

Learned Spectral Masking: In the local frequency domain, each patch is element-wise multiplied by a learnable real-valued filter $F$ 2:

$F$ 3

Inverse Transform and Stitching: The frequency-masked patches are inverse FFTed and reassembled into full feature maps $F$ 4.
Feature Aggregation: All $F$ 5 filtered feature maps are concatenated along the channel axis and reduced with a $F$ 6 convolution to $F$ 7 channels, yielding the final MAFA output $F$ 8.

$F$ 9

An alternative, data-dependent aggregation (using channel-attention weights $F\in\mathbb{R}^{C\times H\times W}$ 0) can be introduced:

$F\in\mathbb{R}^{C\times H\times W}$ 1

3. Frequency-Domain Filtering and Noise Suppression

MAFA’s core innovation lies in its explicit modeling of local spectral content within each patch. By applying learnable frequency-domain masks $F\in\mathbb{R}^{C\times H\times W}$ 2 to the FFT representation of each $F\in\mathbb{R}^{C\times H\times W}$ 3 patch, the adapter can selectively suppress spectral components associated with background noise while preserving edges, textures, and fine structures characteristic of small, salient objects. The frequency-domain filters are independently learned for each scale and channel during backpropagation driven by the terminal binary cross-entropy (BCE) saliency loss. The result is an adaptive spatial-frequency bank that dynamically enhances signal-to-noise in small regions prone to signal burial.

This direct frequency manipulation addresses limitations of both purely spatial convolutional processing and context-agnostic denoising, enabling SPLF-SAM to perform robustly in scenarios with dense, high-frequency background clutter.

4. Hyperparameters, Training Strategy, and Ablation

MAFA parameters and training regimen follow a fixed configuration:

Optimizer: AdamW, initial learning rate $F\in\mathbb{R}^{C\times H\times W}$ 4, weight decay $F\in\mathbb{R}^{C\times H\times W}$ 5
Batch size: 8, Epochs: 50, Hardware: single NVIDIA RTX 5070Ti
Adapter specifics: $F\in\mathbb{R}^{C\times H\times W}$ 6 branches, kernel sizes $F\in\mathbb{R}^{C\times H\times W}$ 7, patch size $F\in\mathbb{R}^{C\times H\times W}$ 8, embedding channels $F\in\mathbb{R}^{C\times H\times W}$ 9, output channels $C'$ 0
Encoder: Frozen; only MAFA, UMFEB, Prompt Bank, and Decoder are trained
Loss: BCE saliency loss $C'$ 1

Ablative comparisons (from Table 2):

Model	$C'$ 2	$C'$ 3
SAM + Decoder* (no MAFA)	0.907	0.034
SAM + MAFA* + Decoder* (w/o FFT)	0.904	0.030
SAM + MAFA + Decoder* (full MAFA)	0.912	0.029
Full SPLF-SAM (all modules)	0.955	0.018

Inclusion of the frequency-domain filtering branch within MAFA yields a net improvement of $C'$ 4 in $C'$ 5 and a reduction of $C'$ 6 in $C'$ 7.

5. Empirical Performance and Qualitative Effects

On benchmark datasets PKU-LF, DUT-LF, HFUT, and Lytro Illum, the SPLF-SAM model with MAFA attains the best S-measure, E-measure, and lowest MAE, surpassing prior state-of-the-art methods in LF SOD (Xu et al., 27 Aug 2025). Qualitatively, MAFA enables the decoder to recover thin stems, delicate variable-width text, and sub-pixel detail that other segmentation baselines—particularly those omitting frequency filtering—either blur or miss entirely. In multi-object scenes with heavily textured surroundings, MAFA’s frequency-domain masks facilitate precise boundary delineation by effectively isolating relevant high-frequency object features from noise.

6. Significance and Place Within SPLF-SAM

MAFA acts as an adaptive “clean-up” phase within the SPLF-SAM hierarchy. By combining multi-scale convolutions, local-patch FFT analysis, and learned, patch-specific frequency-domain filters, it provides the downstream prompt-based UMFEB and decoder modules with a substantially noise-suppressed, detail-amplified feature basis. This direct, trainable frequency suppression mechanism is essential for accurate detection of small, high-frequency salient regions in complex light field imagery, and constitutes a foundational advance in the architecture’s capacity for fine-scale object segmentation (Xu et al., 27 Aug 2025).

Markdown Report Issue Upgrade to Chat

References (1)

SPLF-SAM: Self-Prompting Segment Anything Model for Light Field Salient Object Detection (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multi-Scale Adaptive Filtering Adapter (MAFA).