Frequency-guided Patch Screening (FPS)
- Frequency-guided Patch Screening (FPS) is a learned frequency-domain technique that computes FFT magnitude spectra on overlapping image patches to identify small infrared targets.
- It uses a shallow MLP and geometric-mean fusion to assign pixel-wise target relevance scores, effectively suppressing background noise in low SNR conditions.
- Integrated as the first stage in SEF-DETR, FPS enhances self-attention query initialization, leading to improved detection accuracy and robust performance.
Frequency-guided Patch Screening (FPS) is a learned, frequency-domain module designed to address the challenge of small target discrimination in infrared small target detection (IRSTD) under low signal-to-noise conditions. FPS constructs a high-fidelity, pixel-wise “target relevance” map by analyzing the magnitude spectrum of local image patches via the discrete Fourier transform, with the express goal of suppressing background-dominated features and amplifying areas corresponding to infrared small targets. FPS constitutes the first stage of the SEF-DETR framework, integrating seamlessly with dynamic spatial enhancement and reliability-based fusion mechanisms to guide robust self-attention query initialization and improve end-to-end IRSTD performance (Liu et al., 6 Jan 2026).
1. Mathematical Foundations
Let be a gray-scale infrared image. FPS extracts overlapping local patches (for ) using a sliding window of size and stride , typically with chosen to fully enclose small targets (e.g., or $32$). For each patch, the discrete 2D Fourier transform is computed:
In practice, this transform is computed using FFT, yielding a complex spectrum . The magnitude spectrum is flattened to . A shallow MLP, followed by a sigmoid layer, produces a scalar target-relevance score for each patch:
Patch overlap is exploited via geometric-mean fusion: for each pixel belonging to patches indexed by ,
This produces the frequency-guided density map , in which higher values denote increased likelihood of small target presence based on frequency cues.
Patch-level supervision is provided with binary labels (1 if any ground-truth target pixel lies within , otherwise 0), enabling the use of binary cross-entropy loss:
The overall model joint objective is , with .
2. Algorithmic Procedure
The FPS workflow comprises the following explicit steps:
- Patch Extraction & FFT: Slide a window to extract overlapping patches from . Compute the 2D FFT for each patch to obtain the magnitude spectra.
- MLP-based Scoring: Flatten each patch’s magnitude spectrum and run it through the MLP+sigmoid to generate .
- Score Fusion: For each pixel , fuse the overlapping patch scores using a geometric mean, accumulating in log-space for efficiency.
- Output: Return as a soft, pixel-wise relevance map.
Pseudocode
1 2 3 4 5 6 7 8 9 10 11 12 13 |
for j in 1…J: (x_j, y_j) = top-left of patch j P_j = I[x_j : x_j+p, y_j : y_j+p] F_j = FFT2D(P_j) A_j = abs(F_j) a_j = flatten(A_j) h_j = ReLU(LN(W1 @ a_j + b1)) z_j = W2 @ h_j + b2 s_j = sigmoid(z_j) for each pixel (x, y) in S_freq: log_sum[x, y] += sum of log(s_j) for all patches covering (x, y) count[x, y] += number of patches covering (x, y) S_freq[x, y] = exp(log_sum[x, y] / count[x, y]) |
The design favors geometric mean over arithmetic mean to penalize any patch with low in overlapping regions.
3. Role within SEF-DETR Architecture
FPS is the initial module in the SEF-DETR pipeline, followed sequentially by Dynamic Embedding Enhancement (DEE) and Reliability-Consistency-aware Fusion (RCF):
- FPS generates , an initial map of frequency-based target likelihood.
- DEE bilinearly resizes to multiple spatial scales matching transformer encoder features ; a binary mask is computed via a learned threshold, and used to amplify the encoder features at likely target locations: .
- RCF applies a lightweight spatial classifier to enhanced features, producing spatial confidence . Pixel-wise consistency and reliability are computed based on correspondence between and , yielding final scores for top-K query selection.
This positional and frequency-aware initialization feeds forward into the transformer decoder for robust object detection.
4. Empirical Analyses and Visualization
The effectiveness of FPS has been substantiated via ablation studies, frequency-band analyses, and qualitative evaluations on the IRSTD-1k dataset:
- Ablation Study (AP on IRSTD-1k):
| Configuration | AP | |----------------------------|------| | DINO (ResNet-50 baseline) | 37.1 | | + FPS + DEE | 38.3 | | + FPS + RCF | 38.1 | | Full SEF-DETR (FPS+DEE+RCF)| 38.9 |
FPS provides a significant portion of the gain over vanilla detectors.
- Frequency Band Analysis:
- High-freq only: AP = 37.8
- Low-freq only: AP = 38.4
- Full spectrum: AP = 38.9
Inclusion of the full frequency spectrum maximizes performance; mid-band FFT magnitudes, in particular, correlate with true targets.
- Visualizations:
- FFT magnitude maps highlight mid-frequency spikes for true targets, absent from background/distractor patches.
- Baseline self-attention queries often trigger on background clutter; FPS-guided queries concentrate on authentic small targets, as shown in spatial confidence maps.
- End-to-end detection visualizations demonstrate improved suppression of false alarms and recovery of dim or missed targets.
5. Implementation Characteristics and Computational Overhead
FPS introduces modest complexity:
- Parameters: An added MLP with ≈0.27M parameters.
- Computation: +0.08 GFLOPs for a input, mainly from FFTs of size .
- Inference Steps: Single MLP forward pass per patch, cumulative geometric-mean fusion via log-domain summation, bilinear interpolation to multiscale maps, and final top-K query selection.
- Training: Joint DETR and FPS loss, with typical balancing ().
FPS is optimized for efficiency, relying on highly parallelizable FFTs and lightweight per-patch inference. Its geometric-mean fusion is implemented with cumulative log-sums, which do not become a computational bottleneck.
6. Functional Significance and Related Methodologies
FPS serves as a frequency-domain “scout” that leverages local patch FFT magnitudes for soft, adaptive localization of possible small targets. This initial map not only suppresses background-dominated features in downstream encoder outputs (via DEE) but also strengthens reliability in query selection (via RCF). Empirical evaluation conclusively attributes most of SEF-DETR’s gain over vanilla DETR variants to FPS, while imposing minimal computational burden (Liu et al., 6 Jan 2026).
A plausible implication is that frequency-guided screening may become a generalizable paradigm for small object detection in other low SNR conditions where spatial-only cues are insufficient for robust self-attention signal initialization. No controversy or contradiction regarding FPS has been identified in comparative benchmarks.