Papers
Topics
Authors
Recent
2000 character limit reached

Frequency-guided Patch Screening (FPS)

Updated 13 January 2026
  • Frequency-guided Patch Screening (FPS) is a learned frequency-domain technique that computes FFT magnitude spectra on overlapping image patches to identify small infrared targets.
  • It uses a shallow MLP and geometric-mean fusion to assign pixel-wise target relevance scores, effectively suppressing background noise in low SNR conditions.
  • Integrated as the first stage in SEF-DETR, FPS enhances self-attention query initialization, leading to improved detection accuracy and robust performance.

Frequency-guided Patch Screening (FPS) is a learned, frequency-domain module designed to address the challenge of small target discrimination in infrared small target detection (IRSTD) under low signal-to-noise conditions. FPS constructs a high-fidelity, pixel-wise “target relevance” map by analyzing the magnitude spectrum of local image patches via the discrete Fourier transform, with the express goal of suppressing background-dominated features and amplifying areas corresponding to infrared small targets. FPS constitutes the first stage of the SEF-DETR framework, integrating seamlessly with dynamic spatial enhancement and reliability-based fusion mechanisms to guide robust self-attention query initialization and improve end-to-end IRSTD performance (Liu et al., 6 Jan 2026).

1. Mathematical Foundations

Let IRH×WI \in \mathbb{R}^{H\times W} be a gray-scale infrared image. FPS extracts JJ overlapping local patches PjRp×pP_j \in \mathbb{R}^{p\times p} (for j=1,,Jj=1,\ldots,J) using a sliding window of size p×pp\times p and stride ss, typically with pp chosen to fully enclose small targets (e.g., p=16p=16 or $32$). For each patch, the discrete 2D Fourier transform is computed:

Fj(u,v)=x=0p1y=0p1Pj(x,y)ei2π(ux/p+vy/p)F_j(u,v) = \sum_{x=0}^{p-1} \sum_{y=0}^{p-1} P_j(x,y) \cdot e^{-i2\pi(ux/p + vy/p)}

In practice, this transform is computed using FFT, yielding a complex spectrum FjCp×p\mathcal{F}_j \in \mathbb{C}^{p\times p}. The magnitude spectrum Aj=FjRp×pA_j = |\mathcal{F}_j| \in \mathbb{R}^{p\times p} is flattened to ajRp2a_j \in \mathbb{R}^{p^2}. A shallow MLP, followed by a sigmoid layer, produces a scalar target-relevance score sj(0,1)s_j \in (0,1) for each patch:

hj=ReLU(LN(W1aj+b1)),zj=W2hj+b2,sj=σ(zj)h_j = \text{ReLU}(\text{LN}(W_1 a_j + b_1)), \quad z_j = W_2 h_j + b_2, \quad s_j = \sigma(z_j)

Patch overlap is exploited via geometric-mean fusion: for each pixel (x,y)(x,y) belonging to n(x,y)n(x,y) patches indexed by J(x,y)\mathcal{J}(x,y),

Sfreq(x,y)=(jJ(x,y)sj)1/n(x,y)S_\text{freq}(x,y) = \left(\prod_{j\in \mathcal{J}(x,y)} s_j \right)^{1/n(x,y)}

This produces the frequency-guided density map Sfreq[0,1]H×WS_\text{freq} \in [0,1]^{H\times W}, in which higher values denote increased likelihood of small target presence based on frequency cues.

Patch-level supervision is provided with binary labels yjy_j (1 if any ground-truth target pixel lies within PjP_j, otherwise 0), enabling the use of binary cross-entropy loss:

Lfreq=1Jj=1J[yjlogsj+(1yj)log(1sj)]L_\text{freq} = -\frac{1}{J} \sum_{j=1}^J \left[ y_j \log s_j + (1-y_j) \log(1-s_j) \right]

The overall model joint objective is L=LHungarian(DETR)+λLfreqL = L_\text{Hungarian}(\text{DETR}) + \lambda \cdot L_\text{freq}, with λ=2\lambda=2.

2. Algorithmic Procedure

The FPS workflow comprises the following explicit steps:

  1. Patch Extraction & FFT: Slide a p×pp\times p window to extract JJ overlapping patches from II. Compute the 2D FFT for each patch to obtain the magnitude spectra.
  2. MLP-based Scoring: Flatten each patch’s magnitude spectrum and run it through the MLP+sigmoid to generate sjs_j.
  3. Score Fusion: For each pixel (x,y)(x,y), fuse the overlapping patch scores using a geometric mean, accumulating in log-space for efficiency.
  4. Output: Return SfreqS_\text{freq} as a soft, pixel-wise relevance map.

Pseudocode

1
2
3
4
5
6
7
8
9
10
11
12
13
for j in 1J:
    (x_j, y_j) = top-left of patch j
    P_j = I[x_j : x_j+p, y_j : y_j+p]         
    F_j = FFT2D(P_j)
    A_j = abs(F_j)
    a_j = flatten(A_j)
    h_j = ReLU(LN(W1 @ a_j + b1))
    z_j = W2 @ h_j + b2
    s_j = sigmoid(z_j)
for each pixel (x, y) in S_freq:
    log_sum[x, y] += sum of log(s_j) for all patches covering (x, y)
    count[x, y] += number of patches covering (x, y)
    S_freq[x, y] = exp(log_sum[x, y] / count[x, y])

The design favors geometric mean over arithmetic mean to penalize any patch with low sjs_j in overlapping regions.

3. Role within SEF-DETR Architecture

FPS is the initial module in the SEF-DETR pipeline, followed sequentially by Dynamic Embedding Enhancement (DEE) and Reliability-Consistency-aware Fusion (RCF):

  • FPS generates SfreqS_\text{freq}, an initial map of frequency-based target likelihood.
  • DEE bilinearly resizes SfreqS_\text{freq} to multiple spatial scales matching transformer encoder features QiQ_i; a binary mask MiM_i is computed via a learned threshold, and used to amplify the encoder features at likely target locations: Qi=Qi(1+Mi)Q_i' = Q_i \odot (1+M_i).
  • RCF applies a lightweight spatial classifier to enhanced features, producing spatial confidence SspatialS_\text{spatial}. Pixel-wise consistency C(u,v)C(u,v) and reliability R(u,v)R(u,v) are computed based on correspondence between SfreqS_\text{freq} and SspatialS_\text{spatial}, yielding final scores Sfinal(u,v)S_\text{final}(u,v) for top-K query selection.

This positional and frequency-aware initialization feeds forward into the transformer decoder for robust object detection.

4. Empirical Analyses and Visualization

The effectiveness of FPS has been substantiated via ablation studies, frequency-band analyses, and qualitative evaluations on the IRSTD-1k dataset:

  • Ablation Study (AP on IRSTD-1k):

| Configuration | AP | |----------------------------|------| | DINO (ResNet-50 baseline) | 37.1 | | + FPS + DEE | 38.3 | | + FPS + RCF | 38.1 | | Full SEF-DETR (FPS+DEE+RCF)| 38.9 |

FPS provides a significant portion of the gain over vanilla detectors.

  • Frequency Band Analysis:
    • High-freq only: AP = 37.8
    • Low-freq only: AP = 38.4
    • Full spectrum: AP = 38.9

Inclusion of the full frequency spectrum maximizes performance; mid-band FFT magnitudes, in particular, correlate with true targets.

  • Visualizations:
    • FFT magnitude maps highlight mid-frequency spikes for true targets, absent from background/distractor patches.
    • Baseline self-attention queries often trigger on background clutter; FPS-guided queries concentrate on authentic small targets, as shown in spatial confidence maps.
    • End-to-end detection visualizations demonstrate improved suppression of false alarms and recovery of dim or missed targets.

5. Implementation Characteristics and Computational Overhead

FPS introduces modest complexity:

  • Parameters: An added MLP with ≈0.27M parameters.
  • Computation: +0.08 GFLOPs for a 640×640640\times640 input, mainly from 1600\sim1600 FFTs of size 32×3232\times32.
  • Inference Steps: Single MLP forward pass per patch, cumulative geometric-mean fusion via log-domain summation, bilinear interpolation to multiscale maps, and final top-K query selection.
  • Training: Joint DETR and FPS loss, with typical balancing (λ=2\lambda=2).

FPS is optimized for efficiency, relying on highly parallelizable FFTs and lightweight per-patch inference. Its geometric-mean fusion is implemented with cumulative log-sums, which do not become a computational bottleneck.

FPS serves as a frequency-domain “scout” that leverages local patch FFT magnitudes for soft, adaptive localization of possible small targets. This initial map not only suppresses background-dominated features in downstream encoder outputs (via DEE) but also strengthens reliability in query selection (via RCF). Empirical evaluation conclusively attributes most of SEF-DETR’s gain over vanilla DETR variants to FPS, while imposing minimal computational burden (Liu et al., 6 Jan 2026).

A plausible implication is that frequency-guided screening may become a generalizable paradigm for small object detection in other low SNR conditions where spatial-only cues are insufficient for robust self-attention signal initialization. No controversy or contradiction regarding FPS has been identified in comparative benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Frequency-guided Patch Screening (FPS).