Papers
Topics
Authors
Recent
Search
2000 character limit reached

FeedbackSTS-Det: Infrared Target Detection

Updated 28 January 2026
  • The paper introduces a novel closed-loop spatio-temporal semantic feedback mechanism that enhances infrared small target detection in challenging conditions.
  • It employs a 3D Res-UNet backbone with paired forward and backward refinement modules and an Embedded Sparse Semantic Module to capture long-range dependencies.
  • Experimental results on benchmark datasets demonstrate state-of-the-art accuracy, effective false alarm suppression, and reduced computational cost.

FeedbackSTS-Det is a sparse frames-based spatio-temporal semantic feedback network designed for infrared small target detection (ISTD) under complex backgrounds (Huang et al., 21 Jan 2026). It addresses the challenges of extremely low signal-to-clutter ratio, persistent dynamic interference, and ambiguous target features by integrating a novel closed-loop spatio-temporal semantic feedback mechanism. The architecture features a 3D Res-UNet backbone augmented with paired forward and backward refinement modules and an Embedded @@@@1@@@@ (SSM) that captures long-range temporal dependencies with reduced computational burden. Experimental evidence demonstrates state-of-the-art performance in both accuracy and robustness, particularly in suppressing false alarms, by advancing spatio-temporal semantic modeling for multi-frame ISTD scenarios.

1. Architecture and Core Design Elements

FeedbackSTS-Det is built upon a 3D Res-UNet backbone with five encoder stages (conv_1 to conv_5) and four decoder stages (dec_conv_1 to dec_conv_4). It employs a base channel width of 8, increasing to 128 at the deepest level to balance computational tractability with expressive capacity. The overall pipeline operates on fixed-length sliding windows of DD consecutive frames, a strategy maintained identically during both training and inference.

At each encoder stage, a Forward Spatio-Temporal Semantic Refinement Module (FSTSRM) replaces standard 3D-conv blocks, while each decoder stage employs a paired Backward Spatio-Temporal Semantic Refinement Module (BSTSRM) fused with the corresponding encoder output. This architecture supports continuous semantic feedback across depth, with each decoder not only receiving information from deeper levels but also iteratively refining semantic cues provided by the encoder.

2. Closed-Loop Spatio-Temporal Semantic Feedback Mechanism

Central to FeedbackSTS-Det is a closed-loop semantic association scheme established via paired FSTSRM and BSTSRM at each encoder-decoder level. For an encoder stage ff with input tensor X∈RC×D×H×WX\in\mathbb{R}^{C\times D\times H\times W}:

  • FSTSRM splits the input into context and propagation branches:
    • Xcontext=(BN∘C3×3×3)2(X)X_\text{context} = (BN \circ C_{3\times3\times3})^2(X),
    • Xprop=SSM(C1×1×1(X))X_\text{prop} = \text{SSM}(C_{1\times1\times1}(X)),
    • Output fused as Yfwd=Xcontext⊕XpropY_\text{fwd} = X_\text{context} \oplus X_\text{prop}.
  • BSTSRM (at decoder level ff) receives Xin=Concat(Ydec(f+1),Yfwd(f);dim=1)X_\text{in} = \text{Concat}(Y_\text{dec}^{(f+1)}, Y_\text{fwd}^{(f)}; \text{dim}=1):
    • Computes preserved context as above,
    • Reversed propagation branch incorporates: Xrefine=DR∘SSM∘DR∘C1×1×1(Xin)X_\text{refine} = DR \circ SSM \circ DR \circ C_{1\times1\times1}(X_\text{in}), with DRDR reversing temporal order,
    • Outputs Ybwd=Xcontext⊕XrefineY_\text{bwd} = X_\text{context} \oplus X_\text{refine}.

The encoder forwards semantic cues to the decoder, while the decoder refines and propagates them backward, forming a closed semantic loop that iteratively strengthens true target features and suppresses background-induced errors.

3. Embedded Sparse Semantic Module (SSM) and Temporal Modeling

The SSM is designed to capture long-range temporal dependencies efficiently through structured sparse grouping, intra-group alignment, and temporal reassembly:

  • Sparse grouping: Partition DD frames into TT disjoint groups Gx={Ix,Ix+T,Ix+2T,…}G_x = \{I_x, I_{x+T}, I_{x+2T}, \ldots\}, where TT is the sampling interval and Mx=⌊(D−x)/T⌋M_x = \lfloor (D-x)/T \rfloor.
  • Intra-group propagation: Uses a Basic Feedback Module (BFBM) comprised of a lightweight two-layer pyramid feature extractor (FE) and corresponding pyramid deformable alignment (FA). For GxG_x, propagation follows Ox+kT=FA(FE(Ix+kT),FE(Ix+(k−1)T))O_{x+kT} = FA(FE(I_{x+kT}), FE(I_{x+(k-1)T})).
  • Temporal reassembly: Resulting aligned group outputs Ox\mathcal{O}_x are merged and temporally sorted to YSSM∈RC×D×H×WY_{SSM} \in \mathbb{R}^{C \times D \times H \times W}.

This approach reduces the temporal modeling complexity from O(D2)O(D^2) (fully connected temporal interactions) to O(D⋅T)O(D \cdot T), with empirical results showing that T∈{2,3,4}T \in \{2, 3, 4\} balances accuracy and computational efficiency. The grouping mechanism inherently filters transient noise, negating the need for explicit sparsity regularization.

4. Training Objectives and Pipeline Consistency

The model is optimized end-to-end using a Soft-IoU loss:

LSIoU=1−∑ipigi∑ipi+∑igi−∑ipigi\mathcal{L}_{SIoU} = 1 - \frac{\sum_i p_i g_i}{\sum_i p_i + \sum_i g_i - \sum_i p_i g_i}

where p∈[0,1]H×Wp \in [0, 1]^{H\times W} denotes the predicted foreground mask and g∈{0,1}H×Wg \in \{0,1\}^{H\times W} the ground truth.

Training employs a MultiStepLR schedule with learning rate reductions at epochs {5,10,15,20,25,30}\{5, 10, 15, 20, 25, 30\}. Importantly, FeedbackSTS-Det maintains a fully consistent pipeline between training and inference: both leverage the same fixed-length sliding window over DD frames, and identical instantiations of SSM, FSTSRM, and BSTSRM. This consistency is critical for robust temporal behavior and prevents error accumulation or drift during inference.

5. Experimental Evaluation and Comparative Analysis

Experiments were conducted on the NUDT-MIRSDT (120 sequences, 12,000 images) and IRSatVideo-LEO (200 sequences, 91,022 images) datasets. Evaluation includes pixel-level (mIoU, F1F_1, FaF_a) and object-level (detection rate PdP_d, ROC/AUC) metrics. FeedbackSTS-Det (T=2, 11-frame input) achieved on NUDT-MIRSDT:

  • mIoU = 52.24%
  • F1F_1 = 68.63%
  • PdP_d = 97.41%
  • Fa=1.44×10−5F_a = 1.44 \times 10^{-5}
  • AUC = 99.38%

Analogous gains were observed on IRSatVideo-LEO (PdP_d up to 96.48%, AUC up to 98.16%), substantially outperforming prior model-based and deep learning approaches.

Ablation studies confirm the efficacy of the closed-loop feedback: variants omitting encoder or decoder feedback, or employing only forward or backward feedback, consistently underperform the full "Full-FB" configuration. Additionally, increasing SSM interval TT yields comparable accuracy while reducing computation by up to 80% for long sequences. Naive 3D conversions of 2D networks fail to converge or require excessive FLOPs, contrasting with FeedbackSTS-Det-B8, which reliably converges with only 5.7M parameters and ~30G FLOPs.

6. False Alarm Mitigation and Robustness

The paired FSTSRM and BSTSRM design, incorporating temporal reversal via the DR operator, enables forward propagation of coarse target cues and reverse refinement to filter transient clutter. SSM's sparse temporal grouping naturally suppresses short-lived background phenomena (clouds, waves), emphasizing targets that exhibit consistent spatio-temporal trajectories. The uniformity of the training-inference sliding window ensures that learned registration and feature fusion behaviors generalize robustly in operational settings, minimizing error accumulation.

7. Significance and Distinctive Contributions

FeedbackSTS-Det establishes a framework for state-of-the-art infrared small target detection by integrating:

  • Closed-loop spatio-temporal semantic feedback for progressive refinement,
  • Structured sparse temporal modeling via SSM for long-range dependency capture,
  • Consistent sliding-window pipeline that avoids train/infer mismatches,
  • Efficient parameterization (5.7M parameters, moderate FLOPs) with substantial robustness and false-alarm suppression,
  • Superior empirical performance on benchmark ISTD datasets, with detection rates up to 97.41% and AUC up to 99.38% at low false-alarm rates.

These aspects collectively position FeedbackSTS-Det as a significant advance in robust multi-frame ISTD under complex dynamic backgrounds (Huang et al., 21 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FeedbackSTS-Det.