Thermal Weapon Segmentation

Updated 26 October 2025

Thermal weapon segmentation is the pixel-level isolation of weapons in thermal imagery, enabling detection in low-light and occluded scenarios.
It leverages semantic segmentation, edge-guided networks, and transformer-based models to enhance boundary precision and performance.
Robust multi-modal fusion and tailored loss functions drive high mIoU and efficient real-time detection for critical security applications.

Thermal weapon segmentation refers to the pixel-level isolation and delineation of weapons in thermal imagery, enabling detection in conditions where visible light sensors fail due to darkness, fog, or visual occlusion. The field leverages advances in semantic segmentation, attention mechanisms, multi-modal fusion, and, more recently, transformer architectures to distinguish weapons based on their unique thermal signatures. This article surveys the key principles, representative algorithms, datasets, evaluation methodologies, and research challenges central to thermal weapon segmentation.

1. Fundamentals of Thermal Imaging for Segmentation

Thermal cameras detect infrared radiation, providing imagery based on temperature differences rather than visible color or texture. This modality offers distinct advantages for segmentation:

Invariance to ambient illumination and insensitivity to shadow, glare, or night-time conditions (Li et al., 2019).
Enhanced robustness in adverse scenarios such as haze, smog, and occlusion (Li et al., 2019, Zhang et al., 21 May 2025).
Unique signature of metallic objects (e.g., firearms) that may contrast sharply against human skin or clothing.

However, thermal imagery presents typical challenges: lower native resolution, reduced contrast, ambiguous or fuzzy object boundaries (thermal crossover), and sensor-induced noise.

2. Segmentation Architectures for Thermal Imagery

2.1 Edge-Conditioned and Attention-Guided Networks

Edge-Conditioned CNNs (EC-CNNs) integrate edge prior knowledge via a two-stream architecture: one stream extracts hierarchical edge maps (using the HED edge detector), the other adapts the segmentation backbone to incorporate these priors through a gated feature-wise transform (GFT) layer. The GFT layer modulates intermediate feature maps via spatially-varying scaling and shifting parameters conditioned on edge maps, enhancing boundary delineation (Li et al., 2019):

$\text{GFT}(x) = \hat{\gamma}(\hat{z}) \odot x + \hat{\beta}(\hat{z})$

where $\hat{\gamma}$ and $\hat{\beta}$ are gated via sigmoid activations on edge features.

Attention-based architectures, such as ARTSeg, employ Residual Recurrent CNN (RRCNN) blocks in their encoders and additive attention modules in decoders to retain high-resolution features and improve localization, yielding superior mean IoU in public benchmarks (Munir et al., 2021). Attention coefficients derived from encoder and decoder features highlight salient, thermally distinguishable regions.

Fusion networks such as PST900’s dual-stream CNN utilize an independently trained RGB branch for “base” segmentation, followed by a fusion stage where thermal imagery refines mask boundaries and resolves ambiguities lost in RGB (Shivakumar et al., 2019). Empirical results confirm that straightforward concatenation of modalities can degrade performance; late fusion leveraging independently optimized branches and weighted losses for class imbalance offers accurate results for hard-to-segment objects like weapons.

Hierarchical fusion methods like RSFNet deploy asymmetric encoders (deep ResNet for RGB, shallow for thermal) and use residual spatial fusion (multi-scale convolutions, spatial weighting, and gating) to adaptively combine features based on pseudo-label confidence (Li et al., 2023). Pseudo-labels arise from fine-grained saliency detection and Otsu binarization, ensuring robust feature supervision across lighting conditions.

Spectral-aware approaches (SGFNet) decompose features into low-frequency (global context) and high-frequency (edges, textures) bands via DCT, with spectral-aware channel and spatial attention mechanisms to enhance pixel discrimination of weapons in fused RGB-T input (Zhang et al., 21 May 2025).

2.3 Transformer-Based Models

Vision Transformer (ViT) architectures surpass traditional CNNs in capturing long-range dependencies and fine geometric details—especially crucial for segmenting small or irregularly shaped weapons in cluttered thermal imagery (Kambhatla et al., 19 Oct 2025). Representative models include SegFormer (Mix Transformer encoder, cross-scale fusion), DeepLabV3+ (Atrous Spatial Pyramid Pooling), SegNeXt (multi-scale convolutional attention), and Swin Transformer (hierarchical shifted windows).

CBAM-integrated transformer models like ArmFormer strategically combine spatial and channel attention with MixVisionTransformer backbones. Hamburger-style decoders fuse multi-scale features with global context, enabling real-time (82.26 FPS) multi-class segmentation across weapon categories with state-of-the-art accuracy and computational efficiency (Kambhatla et al., 19 Oct 2025).

Thermal conduction-inspired transformers (TCI-Former) employ physics-motivated modules: the Thermal Conduction-Inspired Attention (TCIA) applies finite difference approximation of the pixel movement differential equation to simulate the flow of features to the target, while the Thermal Conduction Boundary Module (TCBM) enhances boundary precision via second-order derivatives akin to Laplacian operators (Chen et al., 3 Feb 2024).

3. Benchmark Datasets and Labeling Strategies

Reliable thermal weapon segmentation depends on representative datasets:

SODA: 7,168 images, 20 semantic labels, day/night scenes; combines real and synthetic data for robust training (Li et al., 2019).
PST900: 894 pixel-aligned RGB/thermal image pairs with per-pixel annotations, calibrated for underground scenarios (Shivakumar et al., 2019).
TICW: 6,000 thermal images focused on concealed weapon detection, 25 subjects, four weapon categories, multiple annotation formats (Bhardwaj et al., 15 Oct 2025).
Custom Surveillance Dataset: 9,711 FLIR-acquired images from video, binary weapon masks generated using SAM2 and human refinement (Kambhatla et al., 19 Oct 2025).
SATIR: >100,000 thermal images, pseudo-labeled with SAM (Segment Anything Model) via knowledge distillation, facilitating pretraining of segmentation backbones (Chen et al., 2023).

Pseudo-labeling—generating segmentation masks with models like SAM for unannotated thermal imagery—enables construction of large thermal-pretraining datasets. Response-based knowledge distillation bridges generalization gaps, improving fine-tuning outcomes for weapon segmentation (Chen et al., 2023).

4. Performance Evaluation and Trade-offs

Performance metrics for thermal weapon segmentation include mean Intersection over Union (mIoU), pixel accuracy (PA), precision, recall, and mean F-score. For instance:

EC-CNN achieves ~61.9% mIoU on SODA (Li et al., 2019).
ArmFormer attains 80.64% mIoU and 89.13% mFscore, with only 4.886 GFLOPs and 3.66M parameters (Kambhatla et al., 19 Oct 2025).
SegFormer-b5 reaches 94.15% mIoU, while SegFormer-b0 delivers 90.84% mIoU at 98.32 FPS (Kambhatla et al., 19 Oct 2025).
DEF-YOLO surpasses all YOLO baselines with 98.4% [email protected] and 70.3% [email protected]:0.95, maintaining ~277 FPS (Bhardwaj et al., 15 Oct 2025).

Speed-accuracy trade-offs are evident in architecture selection; lightweight transformer designs (SegFormer-b0, ArmFormer) enable deployment on edge devices without significant loss in accuracy.

Handling class imbalance is critical, especially with under-represented weapon types. Focal loss formulations and tailored aggregation rules (ensemble voting thresholds, weighted fusion based on validation accuracy) mitigate false positives and negatives (Bhardwaj et al., 15 Oct 2025, Egiazarov et al., 2020, Egiazarov et al., 2020).

5. Challenges, Limitations, and Future Directions

While current thermal segmentation models are robust to low-light and occlusion, several ongoing challenges remain:

Data Scarcity and Annotation: Many thermal weapon datasets are limited in size or variety; leveraging self-labeling techniques (SAM-based pseudo labels, response-based distillation) helps alleviate this bottleneck (Chen et al., 2023).
Boundary Ambiguity: Fuzzy boundaries and low contrast in thermal images demand explicit edge supervision modules (GFT, ESM), spectral enhancement, and boundary refinement blocks (TCBM) (Li et al., 2019, Li et al., 2022, Chen et al., 3 Feb 2024).
Small Object and Occlusion Robustness: Deformable convolutions, spectral fusion, and multi-scale attention are critical to reliably segmenting thin or partially hidden weapons (Bhardwaj et al., 15 Oct 2025, Zhang et al., 21 May 2025).
Domain Adaptation and Transfer Learning: Refinement of pretrained RGB-based models or transfer from large pseudo-labeled thermal datasets (SATIR) is essential for generalization across modalities and device types (Chen et al., 2023, Kütük et al., 2022).
Real-Time Deployment: Maintaining high accuracy with low computational footprint (≤5M parameters, >80 FPS) remains an active area of architectural innovation, with transformers and attention modules facilitating efficient feature extraction and context fusion (Kambhatla et al., 19 Oct 2025, Kambhatla et al., 19 Oct 2025).
Multi-class and Multi-modal Fusion: Extending binary segmentation to multi-class (handgun, rifle, knife, revolver) detection, and exploring fusion with millimeter-wave or RGB for ambiguous scenarios (Kambhatla et al., 19 Oct 2025, Zhang et al., 21 May 2025).

A plausible implication is that future thermal weapon segmentation systems will integrate transformer-based architectures, spectral attention, multi-modal fusion, and large-scale pseudo-labeled training, leading to scalable and resilient security infrastructure across diverse operational conditions.

6. Representative Algorithms and Mathematical Formulations

Selected mathematical notations and module operations for thermal weapon segmentation:

Module	Notation / Operation	Context
Gated Feature-wise Tr.	GFT( $x$ ) = $\hat{\gamma}(\hat{z}) \odot x + \hat{\beta}(\hat{z})$	Edge-Conditioned CNN (Li et al., 2019)
Spectral Enhancement	$S_{RGB/T}^i = \sum F^i_{RGB/T}(h,w) \cdot B^{(h,w)}_{f_h^i,f_w^i}$	SGFNet (Zhang et al., 21 May 2025)
Pixel Movement (TCIA)	$P^{(t+1)}_{i,j} = \gamma (P^{(t)}_{i+1,j} + ... ) + P^{(t)}_{i,j}$	TCI-Former (Chen et al., 3 Feb 2024)
CBAM Channel Attn.	$M_c = \sigma(MLP(AvgPool(F)) + MLP(MaxPool(F)))$	ArmFormer (Kambhatla et al., 19 Oct 2025)
BCE + Dice Loss	$L_\text{total} = 0.5 L_\text{BCE} + 0.5 L_\text{Dice}$	SegFormer (Kambhatla et al., 19 Oct 2025)
Focal Loss	$L_f(p_t) = -\alpha_t (1-p_t)^\gamma \log(p_t)$	DEF-YOLO (Bhardwaj et al., 15 Oct 2025)

These operations are integral to boundary sharpening, attention mechanisms, spectral fusion, and balanced loss functions mitigating class imbalance and facilitating efficient training convergence.

7. Impact and Applications in Security Infrastructure

Thermal weapon segmentation is integral to security domains where rapid, reliable detection under adverse conditions is mandatory. Primary applications include:

Automated surveillance: Airports, schools, public transport, and border control systems (Kambhatla et al., 19 Oct 2025, Kambhatla et al., 19 Oct 2025).
Embedded devices: Security drones, portable cameras, and distributed AI for real-time scene understanding (Kambhatla et al., 19 Oct 2025).
Military and law enforcement: Tactical operations requiring precision in low-visibility environments (Chen et al., 3 Feb 2024, Bhardwaj et al., 15 Oct 2025).
Crowded public venues: Enabling low-latency, accurate threat assessment.

The ongoing refinement of architectures and dataset expansion, together with transformer-based designs and spectral fusion, is poised to reshape the standard for robust, scalable, and real-time threat segmentation.

In summary, thermal weapon segmentation synthesizes advances in edge-guided feature modulation, attention mechanisms, multi-modal and spectral-aware fusion, transformer-based architectures, and data-centric strategies—offering high-precision, resilient performance for critical security, surveillance, and defense applications. Continued research in semantic modeling, efficient computation, and large-scale annotation will further enhance its operational impact across real-world environments.