Papers
Topics
Authors
Recent
2000 character limit reached

SEF-DETR: Infrared Detection Transformer

Updated 6 February 2026
  • The paper introduces SEF-DETR, which enhances IR small target detection by integrating frequency-guided patch screening, dynamic embedding enhancement, and reliability-consistency fusion.
  • It employs a modified CNN+Transformer architecture—using a DINO-based DETR backbone—to adaptively emphasize target-relevant features in cluttered infrared images.
  • Empirical results on public benchmarks demonstrate state-of-the-art performance with superior precision and minimal computational overhead.

The SEF-DETR (Self-attention Enhanced Feature-DEtection TRansformer) framework is a detection architecture specifically designed to address the challenges of infrared small target detection (IRSTD), where targets present extremely low signal-to-noise ratios, minute object scale, and cluttered backgrounds. SEF-DETR extends a CNN+Transformer encoder–decoder pipeline, integrating a 3-stage enhancement process—Frequency-guided Patch Screening (FPS), @@@@1@@@@ (DEE), and Reliability-Consistency-aware Fusion (RCF)—to overcome the characteristic embedding dilution in standard DETR for IRSTD tasks. The framework empirically delivers state-of-the-art performance on public benchmarks with marginal computational overhead (Liu et al., 6 Jan 2026).

1. Architecture and Design Principles

SEF-DETR augments a DINO-based DETR backbone (ResNet-50 + encoder–decoder Transformer) for IRSTD. The primary innovation is the SEF pipeline, which systematically distills target-relevant information using:

  1. FPS – identifies potential target locations in the frequency domain.
  2. DEE – amplifies target-associated encoder features in frequency-highlighted regions.
  3. RCF – fuses spatial and frequency cues to select highly reliable decoder queries.

Unlike vanilla DETR, which relies on static, uninformative object queries, SEF-DETR injects target-driven priors into the query initialization by dynamically selecting content-adaptive query embeddings from the enhanced final feature map. This process explicitly addresses the dominance of background noise and the unreliability of standard self-attention in complex IR backgrounds.

2. Core Modules and Mathematical Formulation

2.1 Frequency-guided Patch Screening (FPS)

FPS computes a pixelwise target-relevant density map SfreqS_{\mathrm{freq}} using local Fourier analysis:

  • Patch Extraction: The input image IRH×WI\in\mathbb{R}^{H\times W} is partitioned into overlapping patches {Pj}j=1J\{P_j\}_{j=1}^J, size p×pp\times p.
  • Fourier Representation: For each patch,

Fj=FFT(Pj),vj=flatten(Fj).\mathcal F_j = \mathrm{FFT}(P_j), \quad v_j = \mathrm{flatten}(|\mathcal F_j|).

  • MLP Scoring: vjv_j is mapped through a two-layer MLP with LayerNorm+ReLU and a sigmoid-activated linear classifier,

sj=σ(WclsReLU(LN(W2(W1vj))))(0,1).s_j = \sigma\bigl(W_{\rm cls}\,\mathrm{ReLU}(\mathrm{LN}(W_2\,(W_1\,v_j)))\bigr) \in (0, 1).

sjs_j denotes the patch “target-relevance score,” supervised by a binary cross-entropy frequency loss,

Lfreq=1Jj=1J[yjlogsj+(1yj)log(1sj)],\mathcal L_{\rm freq} = -\frac1J\sum_{j=1}^J\bigl[y_j\log s_j + (1-y_j)\log(1-s_j)\bigr],

where yj=1y_j = 1 iff PjP_j overlaps a ground truth box.

  • Density Map Reconstruction: The final density Sfreq(x,y)S_{\mathrm{freq}}(x,y) is obtained via geometric-mean fusion over all patches covering (x,y)(x, y):

Sfreq(x,y)=(k:(x,y)Pksk)1/nxy,S_{\mathrm{freq}}(x, y) = \left(\prod_{k: (x, y)\in P_k} s_k\right)^{1 / n_{xy}},

with nxy=#{k:(x,y)Pk}n_{xy} = \#\{k: (x, y) \in P_k\}.

2.2 Dynamic Embedding Enhancement (DEE)

DEE strengthens multi-scale encoder features to ensure downstream queries retain a target-aware focus:

  • For multi-scale encoder outputs {Qi}i=14\{Q_i\}_{i=1}^4, resize SfreqS_{\mathrm{freq}} as SfreqiS^i_{\mathrm{freq}} for each scale.
  • Apply thresholding using a learnable parameter a(0,1)a\in(0,1) to build binary masks MiM_i:

Mi(x,y)={1,Sfreqi(x,y)>a 0,otherwiseM_i(x,y) = \begin{cases} 1, & S^i_{\mathrm{freq}}(x,y) > a \ 0, & \text{otherwise} \end{cases}

  • Modulate encoder features:

Qi=Qi(1+Mi)Q'_i = Q_i \odot (1 + M_i)

where \odot denotes element-wise multiplication. Regions with Mi=1M_i=1 are thus boosted 2×.

2.3 Reliability-Consistency-aware Fusion (RCF)

RCF fuses spatial and frequency domain confidences to select the best query locations for the decoder:

  • From Q4Q'_4, derive a spatial confidence map:

Sspatial=σ(Conv1×1(Q4))S_{\mathrm{spatial}} = \sigma(\mathrm{Conv}_{1\times1}(Q'_4))

  • For each spatial location (u,v)(u, v), calculate:

C=1Sspatial(u,v)Sfreq(u,v),R=2Sfreq(u,v)0.5C = 1 - |S_{\mathrm{spatial}}(u, v) - S_{\mathrm{freq}}(u, v)|, \quad R = 2|S_{\mathrm{freq}}(u,v) - 0.5|

  • Fuse confidences:

Sfinal(u,v)=Sspatial(u,v)×[1+C(1+R)]S_{\mathrm{final}}(u, v) = S_{\mathrm{spatial}}(u, v) \times [1 + C (1+R)]

  • Selection: Top-KK locations by SfinalS_{\mathrm{final}} define where KK refined queries qk=Q4(uk,vk)+pos_enc(uk,vk)q_k = Q'_4(u_k, v_k) + \mathrm{pos\_enc}(u_k, v_k) are drawn for the decoder.

3. Modified Query Initialization

SEF-DETR’s dynamic query initialization diverges fundamentally from static query approaches in DETR. Rather than image-agnostic learnable vectors, SEF-DETR employs the SEF-enhanced feature tensors Q4Q'_4 and spatial/frequency fusion to select top locations and use those as decoder queries. The procedural steps comprise:

  • Enhancement of multi-scale encoder features using SfreqS_{\mathrm{freq}} and threshold aa.
  • Computation of spatial confidence SspatialS_{\mathrm{spatial}} and the final fusion SfinalS_{\mathrm{final}}.
  • Selection of top KK embeddings from Q4Q'_4 at locations maximizing SfinalS_{\mathrm{final}}, with positional encodings.
  • Feeding these content-adaptive queries into a standard Transformer decoder for detection.

The method thus establishes image- and context-adaptive object queries, directly addressing embedding dilution and enhancing localization reliability for small targets.

4. Training Objective and Optimization

SEF-DETR utilizes a composite loss comprising the DETR-style Hungarian set prediction loss and the FPS frequency loss:

L=LHungarian+λLfreq,λ=2,\mathcal L = \mathcal L_{\rm Hungarian} + \lambda \mathcal L_{\rm freq}, \qquad \lambda=2,

where:

  • LHungarian\mathcal L_{\rm Hungarian} computes optimal box/class assignment using the Hungarian algorithm, combining log-probabilities and a sum of 1\ell_1 distance and GIoU,

LHungarian=i=1Nq[logpσ^(i)(ci)+1ci(bibi1+GIoU(bi,bi))]\mathcal L_{\rm Hungarian} = \sum_{i=1}^{N_q} \Bigl[ -\log p_{\hat\sigma(i)}(c^*_i) + \mathbf{1}_{c^*_i \neq \varnothing} \left( \|b_i - b^*_i\|_1 + \mathrm{GIoU}(b_i, b^*_i) \right) \Bigr]

  • Lfreq\mathcal L_{\rm freq} supervises the binary patch-relevance classification in the FPS branch.

Training is fully end-to-end, with all modules differentiable and integrated within the overall network graph.

5. Inference Pipeline

Inference in SEF-DETR proceeds as follows:

  1. FPS: Compute SfreqS_{\mathrm{freq}} from input IR image.
  2. DEE: Apply learned threshold aa to enhance encoder output features QiQ'_i.
  3. RCF: Construct SspatialS_{\mathrm{spatial}}; compute and fuse with SfreqS_{\mathrm{freq}} to obtain SfinalS_{\mathrm{final}}.
  4. Select top-KK query embeddings from Q4Q'_4, form decoder queries.
  5. Apply Transformer decoder to extract box deltas and class logits.
  6. Softmax over classes; either threshold outputs or select top-MM predictions. No non-maximum suppression (NMS) is applied.

6. Empirical Results and Comparative Evaluation

SEF-DETR sets new benchmarks for IRSTD tasks across three public datasets (IRSTD-1k, NUAA-SIRST, NUDT-SIRST).

6.1. CNN-based Methods Comparison

Method IRSTD-1k (P/R/F1) NUAA-SIRST (P/R/F1) NUDT-SIRST (P/R/F1)
MDvsFA 55.0/48.3/47.5 84.5/50.7/59.7 60.8/19.2/26.2
AGPCNet 41.5/47.0/44.1 39.0/81.0/52.7 36.8/68.4/47.9
ACM 67.9/60.5/64.0 76.5/76.2/76.3 73.2/74.5/73.8
ISNet 71.8/74.1/72.9 82.0/84.7/83.4 74.2/83.4/78.5
ACLNet 84.3/65.6/73.8 84.8/78.0/81.3 86.8/77.2/81.7
DNANet 76.8/72.1/74.4 84.7/83.6/84.1 91.4/88.9/90.1
EFLNet 87.0/81.7/84.3 88.2/85.8/87.0 96.3/93.1/94.7
YOLOv8m 85.7/79.5/82.5 93.4/88.4/90.8 97.2/91.8/94.4
PConv 86.7/80.9/83.7 97.1/89.0/92.9 98.0/94.7/96.4
NS-FPN 89.3/80.9/84.9 94.6/93.8/94.2 98.3/94.7/96.5
SEF-DETR 92.4/85.9/89.0 94.8/97.3/96.1 100.0/96.3/98.1

6.2. DETR-like Methods Comparison (IRSTD-1k, AI-TOD metrics)

Method #Params GFLOPs AP AP₅₀ AP₇₅ AP_vt AP_t AP_s
Deformable-DETR 40.7 M 56.6 31.1 76.1 18.2 23.9 45.0 45.4
DAB-DETR 46.5 M 71.2 34.1 78.2 22.7 28.4 43.9 49.0
DN-DETR 46.5 M 71.2 34.2 77.3 21.7 27.5 45.3 53.2
DINO 45.1 M 80.9 37.1 84.5 24.3 29.6 49.9 59.0
SEF-DETR 45.4 M 81.0 38.9 86.7 27.1 32.8 50.8 56.1

SEF-DETR achieves the highest AP overall and on very-tiny objects (AP_vt +3.2% over DINO), with a minimal computational increase (+0.27M parameters, +0.08 GFLOPs). This substantiates its efficacy and efficiency for the IRSTD regime (Liu et al., 6 Jan 2026).

7. Broader Implications

SEF-DETR demonstrates that frequency-domain priors, dynamic embedding enhancement, and fusion-based query selection can systematically mitigate self-attention-based embedding dilution for small-object IR detection. A plausible implication is that similar approaches could be extended to other regimes characterized by extreme object-background imbalance or targets occupying minimal spatial extent. The framework’s modular design also suggests compatibility with other backbone and detection architectures, potentially generalizing the SEF pipeline beyond IRSTD.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SEF-DETR Framework.