Frequency-guided Patch Screening (FPS)

Updated 13 January 2026

Frequency-guided Patch Screening (FPS) is a learned frequency-domain technique that computes FFT magnitude spectra on overlapping image patches to identify small infrared targets.
It uses a shallow MLP and geometric-mean fusion to assign pixel-wise target relevance scores, effectively suppressing background noise in low SNR conditions.
Integrated as the first stage in SEF-DETR, FPS enhances self-attention query initialization, leading to improved detection accuracy and robust performance.

Frequency-guided Patch Screening (FPS) is a learned, frequency-domain module designed to address the challenge of small target discrimination in infrared small target detection (IRSTD) under low signal-to-noise conditions. FPS constructs a high-fidelity, pixel-wise “target relevance” map by analyzing the magnitude spectrum of local image patches via the discrete Fourier transform, with the express goal of suppressing background-dominated features and amplifying areas corresponding to infrared small targets. FPS constitutes the first stage of the SEF-DETR framework, integrating seamlessly with dynamic spatial enhancement and reliability-based fusion mechanisms to guide robust self-attention query initialization and improve end-to-end IRSTD performance (Liu et al., 6 Jan 2026).

1. Mathematical Foundations

Let $I \in \mathbb{R}^{H\times W}$ be a gray-scale infrared image. FPS extracts $J$ overlapping local patches $P_j \in \mathbb{R}^{p\times p}$ (for $j=1,\ldots,J$ ) using a sliding window of size $p\times p$ and stride $s$ , typically with $p$ chosen to fully enclose small targets (e.g., $p=16$ or $32$). For each patch, the discrete 2D Fourier transform is computed:

$F_j(u,v) = \sum_{x=0}^{p-1} \sum_{y=0}^{p-1} P_j(x,y) \cdot e^{-i2\pi(ux/p + vy/p)}$

In practice, this transform is computed using FFT, yielding a complex spectrum $\mathcal{F}_j \in \mathbb{C}^{p\times p}$ . The magnitude spectrum $A_j = |\mathcal{F}_j| \in \mathbb{R}^{p\times p}$ is flattened to $a_j \in \mathbb{R}^{p^2}$ . A shallow MLP, followed by a sigmoid layer, produces a scalar target-relevance score $s_j \in (0,1)$ for each patch:

$h_j = \text{ReLU}(\text{LN}(W_1 a_j + b_1)), \quad z_j = W_2 h_j + b_2, \quad s_j = \sigma(z_j)$

Patch overlap is exploited via geometric-mean fusion: for each pixel $(x,y)$ belonging to $n(x,y)$ patches indexed by $\mathcal{J}(x,y)$ ,

$S_\text{freq}(x,y) = \left(\prod_{j\in \mathcal{J}(x,y)} s_j \right)^{1/n(x,y)}$

This produces the frequency-guided density map $S_\text{freq} \in [0,1]^{H\times W}$ , in which higher values denote increased likelihood of small target presence based on frequency cues.

Patch-level supervision is provided with binary labels $y_j$ (1 if any ground-truth target pixel lies within $P_j$ , otherwise 0), enabling the use of binary cross-entropy loss:

$L_\text{freq} = -\frac{1}{J} \sum_{j=1}^J \left[ y_j \log s_j + (1-y_j) \log(1-s_j) \right]$

The overall model joint objective is $L = L_\text{Hungarian}(\text{DETR}) + \lambda \cdot L_\text{freq}$ , with $\lambda=2$ .

2. Algorithmic Procedure

The FPS workflow comprises the following explicit steps:

Patch Extraction & FFT: Slide a $p\times p$ window to extract $J$ overlapping patches from $I$ . Compute the 2D FFT for each patch to obtain the magnitude spectra.
MLP-based Scoring: Flatten each patch’s magnitude spectrum and run it through the MLP+sigmoid to generate $s_j$ .
Score Fusion: For each pixel $(x,y)$ , fuse the overlapping patch scores using a geometric mean, accumulating in log-space for efficiency.
Output: Return $S_\text{freq}$ as a soft, pixel-wise relevance map.

Pseudocode

for j in 1…J:
    (x_j, y_j) = top-left of patch j
    P_j = I[x_j : x_j+p, y_j : y_j+p]         
    F_j = FFT2D(P_j)
    A_j = abs(F_j)
    a_j = flatten(A_j)
    h_j = ReLU(LN(W1 @ a_j + b1))
    z_j = W2 @ h_j + b2
    s_j = sigmoid(z_j)
for each pixel (x, y) in S_freq:
    log_sum[x, y] += sum of log(s_j) for all patches covering (x, y)
    count[x, y] += number of patches covering (x, y)
    S_freq[x, y] = exp(log_sum[x, y] / count[x, y])

The design favors geometric mean over arithmetic mean to penalize any patch with low $s_j$ in overlapping regions.

3. Role within SEF-DETR Architecture

FPS is the initial module in the SEF-DETR pipeline, followed sequentially by Dynamic Embedding Enhancement (DEE) and Reliability-Consistency-aware Fusion (RCF):

FPS generates $S_\text{freq}$ , an initial map of frequency-based target likelihood.
DEE bilinearly resizes $S_\text{freq}$ to multiple spatial scales matching transformer encoder features $Q_i$ ; a binary mask $M_i$ is computed via a learned threshold, and used to amplify the encoder features at likely target locations: $Q_i' = Q_i \odot (1+M_i)$ .
RCF applies a lightweight spatial classifier to enhanced features, producing spatial confidence $S_\text{spatial}$ . Pixel-wise consistency $C(u,v)$ and reliability $R(u,v)$ are computed based on correspondence between $S_\text{freq}$ and $S_\text{spatial}$ , yielding final scores $S_\text{final}(u,v)$ for top-K query selection.

This positional and frequency-aware initialization feeds forward into the transformer decoder for robust object detection.

4. Empirical Analyses and Visualization

The effectiveness of FPS has been substantiated via ablation studies, frequency-band analyses, and qualitative evaluations on the IRSTD-1k dataset:

Ablation Study (AP on IRSTD-1k):

| Configuration | AP | |----------------------------|------| | DINO (ResNet-50 baseline) | 37.1 | | + FPS + DEE | 38.3 | | + FPS + RCF | 38.1 | | Full SEF-DETR (FPS+DEE+RCF)| 38.9 |

FPS provides a significant portion of the gain over vanilla detectors.

Frequency Band Analysis:
- High-freq only: AP = 37.8
- Low-freq only: AP = 38.4
- Full spectrum: AP = 38.9

Inclusion of the full frequency spectrum maximizes performance; mid-band FFT magnitudes, in particular, correlate with true targets.

Visualizations:
- FFT magnitude maps highlight mid-frequency spikes for true targets, absent from background/distractor patches.
- Baseline self-attention queries often trigger on background clutter; FPS-guided queries concentrate on authentic small targets, as shown in spatial confidence maps.
- End-to-end detection visualizations demonstrate improved suppression of false alarms and recovery of dim or missed targets.

5. Implementation Characteristics and Computational Overhead

FPS introduces modest complexity:

Parameters: An added MLP with ≈0.27M parameters.
Computation: +0.08 GFLOPs for a $640\times640$ input, mainly from $\sim1600$ FFTs of size $32\times32$ .
Inference Steps: Single MLP forward pass per patch, cumulative geometric-mean fusion via log-domain summation, bilinear interpolation to multiscale maps, and final top-K query selection.
Training: Joint DETR and FPS loss, with typical balancing ( $\lambda=2$ ).

FPS is optimized for efficiency, relying on highly parallelizable FFTs and lightweight per-patch inference. Its geometric-mean fusion is implemented with cumulative log-sums, which do not become a computational bottleneck.

FPS serves as a frequency-domain “scout” that leverages local patch FFT magnitudes for soft, adaptive localization of possible small targets. This initial map not only suppresses background-dominated features in downstream encoder outputs (via DEE) but also strengthens reliability in query selection (via RCF). Empirical evaluation conclusively attributes most of SEF-DETR’s gain over vanilla DETR variants to FPS, while imposing minimal computational burden (Liu et al., 6 Jan 2026).

A plausible implication is that frequency-guided screening may become a generalizable paradigm for small object detection in other low SNR conditions where spatial-only cues are insufficient for robust self-attention signal initialization. No controversy or contradiction regarding FPS has been identified in comparative benchmarks.

Markdown Upgrade to Chat

References (1)

Breaking Self-Attention Failure: Rethinking Query Initialization for Infrared Small Target Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Frequency-guided Patch Screening (FPS).