FireNet: Real-Time Wildfire Perimeter Segmentation
- The paper demonstrates that a pruned residual U-Net with PrevPred improves segmentation accuracy (F1 92) while enabling real-time inference (20 fps) for wildfire perimeters.
- The methodology integrates IR video frames with previous predictions using a specialized Dice loss to effectively address extreme class imbalance.
- Quantitative benchmarks on a large, expert-annotated dataset validate FireNet’s operational value and suggest pathways for future real-time disaster management improvements.
FireNet is a real-time segmentation architecture specifically engineered to delineate wildfire perimeters from aerial full-motion infrared (IR) video streams. Designed for deployment in humanitarian aid and disaster response, FireNet integrates a pruned, residual U-Net architecture, specialized data handling, and a Dice-based loss function to achieve both high segmentation accuracy (92 F1 at production) and low inference latency (20 frames per second) on standard GPU hardware. Its workflow and quantitative benchmarks are grounded in a uniquely large, expert-annotated wildfire dataset and have entered operational use in aerial disaster response pipelines (Doshi et al., 2019).
1. Architecture and Model Design
FireNet adopts a residual U-Net structure with customizations aimed at segmenting fire boundaries in near-real-time from IR video. The canonical workflow processes a 4-channel input tensor, where the first channel is the current IR frame, and the remaining channels are spatial masks representing previous fire predictions at , , and (PrevPred scheme). The architecture in its production form (Pruned + PrevPred) comprises:
- Encoder: Four down-sampling stages (pruned from eight in the full model). Each stage includes a residual block (composed of multiple convolutions, batch normalization, and a leaky ReLU with slope ), followed by max pooling. The number of feature maps is 64 for the first block and 32 for subsequent ones in the pruned variant.
- Bottleneck: A single conv–BN–ReLU block at minimum spatial resolution.
- Decoder: Four symmetric up-sampling stages using transposed convolutions (stride 2), skip connections from the encoder, and a residual block to refine concatenated features at each level.
- Output Head: A convolution maps to a single-channel prediction, followed by a hard sigmoid activation, , producing near-binary segmentation outputs.
- Temporal Consistency: The PrevPred modification incorporates prior model outputs as additional channels to improve frame-to-frame consistency at marginal computational cost.
3D convolutions and LSTM-based temporal modules were assessed but abandoned due to prohibitive computational demands and negligible performance improvements compared to the lightweight PrevPred input channel methodology.
2. Loss Function and Optimization
The FireNet training objective is a differentiable Dice loss, prioritized for its robustness to the extreme class imbalance prevalent in binary fire segmentation:
where are predicted pixel values, are binary ground truth labels, and is added for numerical stability. Binary cross-entropy was empirically found to underperform, as it is agnostic to volumetric overlap (Dice). No additional loss regularization or weighting schemes are used beyond standard batch normalization.
Optimization utilizes Adam (, ) with He normal initialization, fixed batch size of 8, and an adaptive learning rate schedule (halving on plateaux in validation loss).
3. Dataset and Preprocessing
The FireNet training dataset comprises approximately 400,000 IR frames collected from wildfire missions over U.S. forests, with around 100,000 frames including active fire. Annotation was performed by expert contractors, under rigorous quality assurance directed by CAL FIRE and the California Air National Guard, according to a strict “fire perimeter” protocol (outer boundary of burning or burnt regions).
Frames are split into training, validation, and test sets in an 80/10/10 ratio, with each split sourced from disjoint flight segments to prevent temporal information leakage. All data are stored as single-channel IR frames at original sensor resolution.
Data augmentation during training consists of random uniform scaling ([1/1.05, 1.05]), small rotations (±5°), horizontal/vertical flips, salt-and-pepper noise, and shear transforms (±5°). Each frame has a 10% chance of being augmented per epoch. For models using PrevPred, further augmentations are applied to prior mask channels: random empty masks (sequence starts), small affine perturbations, and teacher forcing (ground-truth masks fed instead of model predictions in early epochs).
4. Inference Speed and Deployment Considerations
The production-ready Pruned + PrevPred model achieves 20 frames per second on a single NVIDIA K80 GPU without post-pruning quantization or specialized hardware acceleration. Key contributors to inference efficiency are:
- Aggressive pruning of architecture (reduced depth to 4 encoder/decoder stages, thinned channel width)
- Elimination of 3D temporal modules; all temporal context is encoded via the PrevPred channel scheme
- Use of full sensor resolution—minimizing expensive re-sampling and preserving spatial detail
- Efficient implementation of transposed convolutions and fused conv–BN–activation layers on GPU
These choices enable FireNet’s integration in real-time disaster response where computational resources are constrained but accurate, low-latency perimeter estimation is critical to operational decision-making.
5. Quantitative Assessment and Ablation
FireNet’s primary evaluation metric is the F1 score, equivalent to the Dice coefficient in binary segmentation scenarios:
Additionally, Intersection-over-Union (IoU) is reported:
Ablation studies indicate:
| Model Variant | Speed (fps) | F1 Score |
|---|---|---|
| Basic U-Net | 5 | 94 |
| U-Net + PrevPred | 3 | 95 |
| Pruned w/o PrevPred | 22 | 86 |
| Pruned + PrevPred (prod) | 20 | 92 |
Notably, adding PrevPred to the full U-Net improves F1 (94→95) but at substantial speed cost (5→3 fps). Pruning boosts speed dramatically (22 fps) but at cost to F1 (86). The Pruned + PrevPred model restores much of the lost accuracy (F1 92) with only a modest computational overhead (20 fps), representing the optimal operational trade-off for production use (Doshi et al., 2019).
6. Limitations and Future Directions
FireNet’s present limitations include susceptibility to false negatives along newly ignited perimeter edges in scenarios with rapid IR contrast changes or heavy smoke. Areas with low IR signal, such as smoldering ground under dense canopy, may be under-segmented. The PrevPred approach, while computationally efficient, does not guarantee full spatio-temporal smoothness, occasionally resulting in prediction flicker.
Proposed avenues for further development include the incorporation of attention mechanisms for ambiguous regions, evaluation of lightweight spatio-temporal processing layers (e.g., depth-wise 3D convolutions, temporal shift modules), multisensor data fusion (visible-light, multispectral, LiDAR), additional model compression, and quantization for embedded inference. The dataset may be open-sourced under a domain-expert-vetted license to catalyze broader community advancements.
7. Operational Impact and Significance
The deployment of FireNet has enabled semi-automated, expert-level fire perimeter segmentation in aerial disaster response operations, alleviating manual analyst labor while preserving accuracy. Its real-time capability and resource efficiency set a benchmark for follow-on research and underscore the practical value of focused architectural pruning and loss specialization in applied remote sensing for disaster management (Doshi et al., 2019).