SEF-DETR: Infrared Detection Transformer
- The paper introduces SEF-DETR, which enhances IR small target detection by integrating frequency-guided patch screening, dynamic embedding enhancement, and reliability-consistency fusion.
- It employs a modified CNN+Transformer architecture—using a DINO-based DETR backbone—to adaptively emphasize target-relevant features in cluttered infrared images.
- Empirical results on public benchmarks demonstrate state-of-the-art performance with superior precision and minimal computational overhead.
The SEF-DETR (Self-attention Enhanced Feature-DEtection TRansformer) framework is a detection architecture specifically designed to address the challenges of infrared small target detection (IRSTD), where targets present extremely low signal-to-noise ratios, minute object scale, and cluttered backgrounds. SEF-DETR extends a CNN+Transformer encoder–decoder pipeline, integrating a 3-stage enhancement process—Frequency-guided Patch Screening (FPS), @@@@1@@@@ (DEE), and Reliability-Consistency-aware Fusion (RCF)—to overcome the characteristic embedding dilution in standard DETR for IRSTD tasks. The framework empirically delivers state-of-the-art performance on public benchmarks with marginal computational overhead (Liu et al., 6 Jan 2026).
1. Architecture and Design Principles
SEF-DETR augments a DINO-based DETR backbone (ResNet-50 + encoder–decoder Transformer) for IRSTD. The primary innovation is the SEF pipeline, which systematically distills target-relevant information using:
- FPS – identifies potential target locations in the frequency domain.
- DEE – amplifies target-associated encoder features in frequency-highlighted regions.
- RCF – fuses spatial and frequency cues to select highly reliable decoder queries.
Unlike vanilla DETR, which relies on static, uninformative object queries, SEF-DETR injects target-driven priors into the query initialization by dynamically selecting content-adaptive query embeddings from the enhanced final feature map. This process explicitly addresses the dominance of background noise and the unreliability of standard self-attention in complex IR backgrounds.
2. Core Modules and Mathematical Formulation
2.1 Frequency-guided Patch Screening (FPS)
FPS computes a pixelwise target-relevant density map using local Fourier analysis:
- Patch Extraction: The input image is partitioned into overlapping patches , size .
- Fourier Representation: For each patch,
- MLP Scoring: is mapped through a two-layer MLP with LayerNorm+ReLU and a sigmoid-activated linear classifier,
denotes the patch “target-relevance score,” supervised by a binary cross-entropy frequency loss,
where iff overlaps a ground truth box.
- Density Map Reconstruction: The final density is obtained via geometric-mean fusion over all patches covering :
with .
2.2 Dynamic Embedding Enhancement (DEE)
DEE strengthens multi-scale encoder features to ensure downstream queries retain a target-aware focus:
- For multi-scale encoder outputs , resize as for each scale.
- Apply thresholding using a learnable parameter to build binary masks :
- Modulate encoder features:
where denotes element-wise multiplication. Regions with are thus boosted 2×.
2.3 Reliability-Consistency-aware Fusion (RCF)
RCF fuses spatial and frequency domain confidences to select the best query locations for the decoder:
- From , derive a spatial confidence map:
- For each spatial location , calculate:
- Fuse confidences:
- Selection: Top- locations by define where refined queries are drawn for the decoder.
3. Modified Query Initialization
SEF-DETR’s dynamic query initialization diverges fundamentally from static query approaches in DETR. Rather than image-agnostic learnable vectors, SEF-DETR employs the SEF-enhanced feature tensors and spatial/frequency fusion to select top locations and use those as decoder queries. The procedural steps comprise:
- Enhancement of multi-scale encoder features using and threshold .
- Computation of spatial confidence and the final fusion .
- Selection of top embeddings from at locations maximizing , with positional encodings.
- Feeding these content-adaptive queries into a standard Transformer decoder for detection.
The method thus establishes image- and context-adaptive object queries, directly addressing embedding dilution and enhancing localization reliability for small targets.
4. Training Objective and Optimization
SEF-DETR utilizes a composite loss comprising the DETR-style Hungarian set prediction loss and the FPS frequency loss:
where:
- computes optimal box/class assignment using the Hungarian algorithm, combining log-probabilities and a sum of distance and GIoU,
- supervises the binary patch-relevance classification in the FPS branch.
Training is fully end-to-end, with all modules differentiable and integrated within the overall network graph.
5. Inference Pipeline
Inference in SEF-DETR proceeds as follows:
- FPS: Compute from input IR image.
- DEE: Apply learned threshold to enhance encoder output features .
- RCF: Construct ; compute and fuse with to obtain .
- Select top- query embeddings from , form decoder queries.
- Apply Transformer decoder to extract box deltas and class logits.
- Softmax over classes; either threshold outputs or select top- predictions. No non-maximum suppression (NMS) is applied.
6. Empirical Results and Comparative Evaluation
SEF-DETR sets new benchmarks for IRSTD tasks across three public datasets (IRSTD-1k, NUAA-SIRST, NUDT-SIRST).
6.1. CNN-based Methods Comparison
| Method | IRSTD-1k (P/R/F1) | NUAA-SIRST (P/R/F1) | NUDT-SIRST (P/R/F1) |
|---|---|---|---|
| MDvsFA | 55.0/48.3/47.5 | 84.5/50.7/59.7 | 60.8/19.2/26.2 |
| AGPCNet | 41.5/47.0/44.1 | 39.0/81.0/52.7 | 36.8/68.4/47.9 |
| ACM | 67.9/60.5/64.0 | 76.5/76.2/76.3 | 73.2/74.5/73.8 |
| ISNet | 71.8/74.1/72.9 | 82.0/84.7/83.4 | 74.2/83.4/78.5 |
| ACLNet | 84.3/65.6/73.8 | 84.8/78.0/81.3 | 86.8/77.2/81.7 |
| DNANet | 76.8/72.1/74.4 | 84.7/83.6/84.1 | 91.4/88.9/90.1 |
| EFLNet | 87.0/81.7/84.3 | 88.2/85.8/87.0 | 96.3/93.1/94.7 |
| YOLOv8m | 85.7/79.5/82.5 | 93.4/88.4/90.8 | 97.2/91.8/94.4 |
| PConv | 86.7/80.9/83.7 | 97.1/89.0/92.9 | 98.0/94.7/96.4 |
| NS-FPN | 89.3/80.9/84.9 | 94.6/93.8/94.2 | 98.3/94.7/96.5 |
| SEF-DETR | 92.4/85.9/89.0 | 94.8/97.3/96.1 | 100.0/96.3/98.1 |
6.2. DETR-like Methods Comparison (IRSTD-1k, AI-TOD metrics)
| Method | #Params | GFLOPs | AP | AP₅₀ | AP₇₅ | AP_vt | AP_t | AP_s |
|---|---|---|---|---|---|---|---|---|
| Deformable-DETR | 40.7 M | 56.6 | 31.1 | 76.1 | 18.2 | 23.9 | 45.0 | 45.4 |
| DAB-DETR | 46.5 M | 71.2 | 34.1 | 78.2 | 22.7 | 28.4 | 43.9 | 49.0 |
| DN-DETR | 46.5 M | 71.2 | 34.2 | 77.3 | 21.7 | 27.5 | 45.3 | 53.2 |
| DINO | 45.1 M | 80.9 | 37.1 | 84.5 | 24.3 | 29.6 | 49.9 | 59.0 |
| SEF-DETR | 45.4 M | 81.0 | 38.9 | 86.7 | 27.1 | 32.8 | 50.8 | 56.1 |
SEF-DETR achieves the highest AP overall and on very-tiny objects (AP_vt +3.2% over DINO), with a minimal computational increase (+0.27M parameters, +0.08 GFLOPs). This substantiates its efficacy and efficiency for the IRSTD regime (Liu et al., 6 Jan 2026).
7. Broader Implications
SEF-DETR demonstrates that frequency-domain priors, dynamic embedding enhancement, and fusion-based query selection can systematically mitigate self-attention-based embedding dilution for small-object IR detection. A plausible implication is that similar approaches could be extended to other regimes characterized by extreme object-background imbalance or targets occupying minimal spatial extent. The framework’s modular design also suggests compatibility with other backbone and detection architectures, potentially generalizing the SEF pipeline beyond IRSTD.