SEF-DETR: Infrared Detection Transformer

Updated 6 February 2026

The paper introduces SEF-DETR, which enhances IR small target detection by integrating frequency-guided patch screening, dynamic embedding enhancement, and reliability-consistency fusion.
It employs a modified CNN+Transformer architecture—using a DINO-based DETR backbone—to adaptively emphasize target-relevant features in cluttered infrared images.
Empirical results on public benchmarks demonstrate state-of-the-art performance with superior precision and minimal computational overhead.

The SEF-DETR (Self-attention Enhanced Feature-DEtection TRansformer) framework is a detection architecture specifically designed to address the challenges of infrared small target detection (IRSTD), where targets present extremely low signal-to-noise ratios, minute object scale, and cluttered backgrounds. SEF-DETR extends a CNN+Transformer encoder–decoder pipeline, integrating a 3-stage enhancement process—Frequency-guided Patch Screening (FPS), @@@@1@@@@ (DEE), and Reliability-Consistency-aware Fusion (RCF)—to overcome the characteristic embedding dilution in standard DETR for IRSTD tasks. The framework empirically delivers state-of-the-art performance on public benchmarks with marginal computational overhead (Liu et al., 6 Jan 2026).

1. Architecture and Design Principles

SEF-DETR augments a DINO-based DETR backbone (ResNet-50 + encoder–decoder Transformer) for IRSTD. The primary innovation is the SEF pipeline, which systematically distills target-relevant information using:

FPS – identifies potential target locations in the frequency domain.
DEE – amplifies target-associated encoder features in frequency-highlighted regions.
RCF – fuses spatial and frequency cues to select highly reliable decoder queries.

Unlike vanilla DETR, which relies on static, uninformative object queries, SEF-DETR injects target-driven priors into the query initialization by dynamically selecting content-adaptive query embeddings from the enhanced final feature map. This process explicitly addresses the dominance of background noise and the unreliability of standard self-attention in complex IR backgrounds.

2. Core Modules and Mathematical Formulation

2.1 Frequency-guided Patch Screening (FPS)

FPS computes a pixelwise target-relevant density map $S_{\mathrm{freq}}$ using local Fourier analysis:

Patch Extraction: The input image $I\in\mathbb{R}^{H\times W}$ is partitioned into overlapping patches $\{P_j\}_{j=1}^J$ , size $p\times p$ .
Fourier Representation: For each patch,

$\mathcal F_j = \mathrm{FFT}(P_j), \quad v_j = \mathrm{flatten}(|\mathcal F_j|).$

MLP Scoring: $v_j$ is mapped through a two-layer MLP with LayerNorm+ReLU and a sigmoid-activated linear classifier,

$s_j = \sigma\bigl(W_{\rm cls}\,\mathrm{ReLU}(\mathrm{LN}(W_2\,(W_1\,v_j)))\bigr) \in (0, 1).$

$s_j$ denotes the patch “target-relevance score,” supervised by a binary cross-entropy frequency loss,

$\mathcal L_{\rm freq} = -\frac1J\sum_{j=1}^J\bigl[y_j\log s_j + (1-y_j)\log(1-s_j)\bigr],$

where $y_j = 1$ iff $P_j$ overlaps a ground truth box.

Density Map Reconstruction: The final density $S_{\mathrm{freq}}(x,y)$ is obtained via geometric-mean fusion over all patches covering $(x, y)$ :

$S_{\mathrm{freq}}(x, y) = \left(\prod_{k: (x, y)\in P_k} s_k\right)^{1 / n_{xy}},$

with $n_{xy} = \#\{k: (x, y) \in P_k\}$ .

2.2 Dynamic Embedding Enhancement (DEE)

DEE strengthens multi-scale encoder features to ensure downstream queries retain a target-aware focus:

For multi-scale encoder outputs $\{Q_i\}_{i=1}^4$ , resize $S_{\mathrm{freq}}$ as $S^i_{\mathrm{freq}}$ for each scale.
Apply thresholding using a learnable parameter $a\in(0,1)$ to build binary masks $M_i$ :

$M_i(x,y) = \begin{cases} 1, & S^i_{\mathrm{freq}}(x,y) > a \ 0, & \text{otherwise} \end{cases}$

Modulate encoder features:

$Q'_i = Q_i \odot (1 + M_i)$

where $\odot$ denotes element-wise multiplication. Regions with $M_i=1$ are thus boosted 2×.

2.3 Reliability-Consistency-aware Fusion (RCF)

RCF fuses spatial and frequency domain confidences to select the best query locations for the decoder:

From $Q'_4$ , derive a spatial confidence map:

$S_{\mathrm{spatial}} = \sigma(\mathrm{Conv}_{1\times1}(Q'_4))$

For each spatial location $(u, v)$ , calculate:

$C = 1 - |S_{\mathrm{spatial}}(u, v) - S_{\mathrm{freq}}(u, v)|, \quad R = 2|S_{\mathrm{freq}}(u,v) - 0.5|$

Fuse confidences:

$S_{\mathrm{final}}(u, v) = S_{\mathrm{spatial}}(u, v) \times [1 + C (1+R)]$

Selection: Top- $K$ locations by $S_{\mathrm{final}}$ define where $K$ refined queries $q_k = Q'_4(u_k, v_k) + \mathrm{pos\_enc}(u_k, v_k)$ are drawn for the decoder.

3. Modified Query Initialization

SEF-DETR’s dynamic query initialization diverges fundamentally from static query approaches in DETR. Rather than image-agnostic learnable vectors, SEF-DETR employs the SEF-enhanced feature tensors $Q'_4$ and spatial/frequency fusion to select top locations and use those as decoder queries. The procedural steps comprise:

Enhancement of multi-scale encoder features using $S_{\mathrm{freq}}$ and threshold $a$ .
Computation of spatial confidence $S_{\mathrm{spatial}}$ and the final fusion $S_{\mathrm{final}}$ .
Selection of top $K$ embeddings from $Q'_4$ at locations maximizing $S_{\mathrm{final}}$ , with positional encodings.
Feeding these content-adaptive queries into a standard Transformer decoder for detection.

The method thus establishes image- and context-adaptive object queries, directly addressing embedding dilution and enhancing localization reliability for small targets.

4. Training Objective and Optimization

SEF-DETR utilizes a composite loss comprising the DETR-style Hungarian set prediction loss and the FPS frequency loss:

$\mathcal L = \mathcal L_{\rm Hungarian} + \lambda \mathcal L_{\rm freq}, \qquad \lambda=2,$

where:

$\mathcal L_{\rm Hungarian}$ computes optimal box/class assignment using the Hungarian algorithm, combining log-probabilities and a sum of $\ell_1$ distance and GIoU,

$\mathcal L_{\rm Hungarian} = \sum_{i=1}^{N_q} \Bigl[ -\log p_{\hat\sigma(i)}(c^*_i) + \mathbf{1}_{c^*_i \neq \varnothing} \left( \|b_i - b^*_i\|_1 + \mathrm{GIoU}(b_i, b^*_i) \right) \Bigr]$

$\mathcal L_{\rm freq}$ supervises the binary patch-relevance classification in the FPS branch.

Training is fully end-to-end, with all modules differentiable and integrated within the overall network graph.

5. Inference Pipeline

Inference in SEF-DETR proceeds as follows:

FPS: Compute $S_{\mathrm{freq}}$ from input IR image.
DEE: Apply learned threshold $a$ to enhance encoder output features $Q'_i$ .
RCF: Construct $S_{\mathrm{spatial}}$ ; compute and fuse with $S_{\mathrm{freq}}$ to obtain $S_{\mathrm{final}}$ .
Select top- $K$ query embeddings from $Q'_4$ , form decoder queries.
Apply Transformer decoder to extract box deltas and class logits.
Softmax over classes; either threshold outputs or select top- $M$ predictions. No non-maximum suppression (NMS) is applied.

6. Empirical Results and Comparative Evaluation

SEF-DETR sets new benchmarks for IRSTD tasks across three public datasets (IRSTD-1k, NUAA-SIRST, NUDT-SIRST).

6.1. CNN-based Methods Comparison

Method	IRSTD-1k (P/R/F1)	NUAA-SIRST (P/R/F1)	NUDT-SIRST (P/R/F1)
MDvsFA	55.0/48.3/47.5	84.5/50.7/59.7	60.8/19.2/26.2
AGPCNet	41.5/47.0/44.1	39.0/81.0/52.7	36.8/68.4/47.9
ACM	67.9/60.5/64.0	76.5/76.2/76.3	73.2/74.5/73.8
ISNet	71.8/74.1/72.9	82.0/84.7/83.4	74.2/83.4/78.5
ACLNet	84.3/65.6/73.8	84.8/78.0/81.3	86.8/77.2/81.7
DNANet	76.8/72.1/74.4	84.7/83.6/84.1	91.4/88.9/90.1
EFLNet	87.0/81.7/84.3	88.2/85.8/87.0	96.3/93.1/94.7
YOLOv8m	85.7/79.5/82.5	93.4/88.4/90.8	97.2/91.8/94.4
PConv	86.7/80.9/83.7	97.1/89.0/92.9	98.0/94.7/96.4
NS-FPN	89.3/80.9/84.9	94.6/93.8/94.2	98.3/94.7/96.5
SEF-DETR	92.4/85.9/89.0	94.8/97.3/96.1	100.0/96.3/98.1

6.2. DETR-like Methods Comparison (IRSTD-1k, AI-TOD metrics)

Method	#Params	GFLOPs	AP	AP₅₀	AP₇₅	AP_vt	AP_t	AP_s
Deformable-DETR	40.7 M	56.6	31.1	76.1	18.2	23.9	45.0	45.4
DAB-DETR	46.5 M	71.2	34.1	78.2	22.7	28.4	43.9	49.0
DN-DETR	46.5 M	71.2	34.2	77.3	21.7	27.5	45.3	53.2
DINO	45.1 M	80.9	37.1	84.5	24.3	29.6	49.9	59.0
SEF-DETR	45.4 M	81.0	38.9	86.7	27.1	32.8	50.8	56.1

SEF-DETR achieves the highest AP overall and on very-tiny objects (AP_vt +3.2% over DINO), with a minimal computational increase (+0.27M parameters, +0.08 GFLOPs). This substantiates its efficacy and efficiency for the IRSTD regime (Liu et al., 6 Jan 2026).

7. Broader Implications

SEF-DETR demonstrates that frequency-domain priors, dynamic embedding enhancement, and fusion-based query selection can systematically mitigate self-attention-based embedding dilution for small-object IR detection. A plausible implication is that similar approaches could be extended to other regimes characterized by extreme object-background imbalance or targets occupying minimal spatial extent. The framework’s modular design also suggests compatibility with other backbone and detection architectures, potentially generalizing the SEF pipeline beyond IRSTD.

Markdown Upgrade to Chat

References (1)

Breaking Self-Attention Failure: Rethinking Query Initialization for Infrared Small Target Detection (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SEF-DETR Framework.