FeedbackSTS-Det: Infrared Target Detection
- The paper introduces a novel closed-loop spatio-temporal semantic feedback mechanism that enhances infrared small target detection in challenging conditions.
- It employs a 3D Res-UNet backbone with paired forward and backward refinement modules and an Embedded Sparse Semantic Module to capture long-range dependencies.
- Experimental results on benchmark datasets demonstrate state-of-the-art accuracy, effective false alarm suppression, and reduced computational cost.
FeedbackSTS-Det is a sparse frames-based spatio-temporal semantic feedback network designed for infrared small target detection (ISTD) under complex backgrounds (Huang et al., 21 Jan 2026). It addresses the challenges of extremely low signal-to-clutter ratio, persistent dynamic interference, and ambiguous target features by integrating a novel closed-loop spatio-temporal semantic feedback mechanism. The architecture features a 3D Res-UNet backbone augmented with paired forward and backward refinement modules and an Embedded @@@@1@@@@ (SSM) that captures long-range temporal dependencies with reduced computational burden. Experimental evidence demonstrates state-of-the-art performance in both accuracy and robustness, particularly in suppressing false alarms, by advancing spatio-temporal semantic modeling for multi-frame ISTD scenarios.
1. Architecture and Core Design Elements
FeedbackSTS-Det is built upon a 3D Res-UNet backbone with five encoder stages (conv_1 to conv_5) and four decoder stages (dec_conv_1 to dec_conv_4). It employs a base channel width of 8, increasing to 128 at the deepest level to balance computational tractability with expressive capacity. The overall pipeline operates on fixed-length sliding windows of consecutive frames, a strategy maintained identically during both training and inference.
At each encoder stage, a Forward Spatio-Temporal Semantic Refinement Module (FSTSRM) replaces standard 3D-conv blocks, while each decoder stage employs a paired Backward Spatio-Temporal Semantic Refinement Module (BSTSRM) fused with the corresponding encoder output. This architecture supports continuous semantic feedback across depth, with each decoder not only receiving information from deeper levels but also iteratively refining semantic cues provided by the encoder.
2. Closed-Loop Spatio-Temporal Semantic Feedback Mechanism
Central to FeedbackSTS-Det is a closed-loop semantic association scheme established via paired FSTSRM and BSTSRM at each encoder-decoder level. For an encoder stage with input tensor :
- FSTSRM splits the input into context and propagation branches:
- ,
- ,
- Output fused as .
- BSTSRM (at decoder level ) receives :
- Computes preserved context as above,
- Reversed propagation branch incorporates: , with reversing temporal order,
- Outputs .
The encoder forwards semantic cues to the decoder, while the decoder refines and propagates them backward, forming a closed semantic loop that iteratively strengthens true target features and suppresses background-induced errors.
3. Embedded Sparse Semantic Module (SSM) and Temporal Modeling
The SSM is designed to capture long-range temporal dependencies efficiently through structured sparse grouping, intra-group alignment, and temporal reassembly:
- Sparse grouping: Partition frames into disjoint groups , where is the sampling interval and .
- Intra-group propagation: Uses a Basic Feedback Module (BFBM) comprised of a lightweight two-layer pyramid feature extractor (FE) and corresponding pyramid deformable alignment (FA). For , propagation follows .
- Temporal reassembly: Resulting aligned group outputs are merged and temporally sorted to .
This approach reduces the temporal modeling complexity from (fully connected temporal interactions) to , with empirical results showing that balances accuracy and computational efficiency. The grouping mechanism inherently filters transient noise, negating the need for explicit sparsity regularization.
4. Training Objectives and Pipeline Consistency
The model is optimized end-to-end using a Soft-IoU loss:
where denotes the predicted foreground mask and the ground truth.
Training employs a MultiStepLR schedule with learning rate reductions at epochs . Importantly, FeedbackSTS-Det maintains a fully consistent pipeline between training and inference: both leverage the same fixed-length sliding window over frames, and identical instantiations of SSM, FSTSRM, and BSTSRM. This consistency is critical for robust temporal behavior and prevents error accumulation or drift during inference.
5. Experimental Evaluation and Comparative Analysis
Experiments were conducted on the NUDT-MIRSDT (120 sequences, 12,000 images) and IRSatVideo-LEO (200 sequences, 91,022 images) datasets. Evaluation includes pixel-level (mIoU, , ) and object-level (detection rate , ROC/AUC) metrics. FeedbackSTS-Det (T=2, 11-frame input) achieved on NUDT-MIRSDT:
- mIoU = 52.24%
- = 68.63%
- = 97.41%
- AUC = 99.38%
Analogous gains were observed on IRSatVideo-LEO ( up to 96.48%, AUC up to 98.16%), substantially outperforming prior model-based and deep learning approaches.
Ablation studies confirm the efficacy of the closed-loop feedback: variants omitting encoder or decoder feedback, or employing only forward or backward feedback, consistently underperform the full "Full-FB" configuration. Additionally, increasing SSM interval yields comparable accuracy while reducing computation by up to 80% for long sequences. Naive 3D conversions of 2D networks fail to converge or require excessive FLOPs, contrasting with FeedbackSTS-Det-B8, which reliably converges with only 5.7M parameters and ~30G FLOPs.
6. False Alarm Mitigation and Robustness
The paired FSTSRM and BSTSRM design, incorporating temporal reversal via the DR operator, enables forward propagation of coarse target cues and reverse refinement to filter transient clutter. SSM's sparse temporal grouping naturally suppresses short-lived background phenomena (clouds, waves), emphasizing targets that exhibit consistent spatio-temporal trajectories. The uniformity of the training-inference sliding window ensures that learned registration and feature fusion behaviors generalize robustly in operational settings, minimizing error accumulation.
7. Significance and Distinctive Contributions
FeedbackSTS-Det establishes a framework for state-of-the-art infrared small target detection by integrating:
- Closed-loop spatio-temporal semantic feedback for progressive refinement,
- Structured sparse temporal modeling via SSM for long-range dependency capture,
- Consistent sliding-window pipeline that avoids train/infer mismatches,
- Efficient parameterization (5.7M parameters, moderate FLOPs) with substantial robustness and false-alarm suppression,
- Superior empirical performance on benchmark ISTD datasets, with detection rates up to 97.41% and AUC up to 99.38% at low false-alarm rates.
These aspects collectively position FeedbackSTS-Det as a significant advance in robust multi-frame ISTD under complex dynamic backgrounds (Huang et al., 21 Jan 2026).