PhysFire-WM: Unified Fire Spread Emulation Framework
- PhysFire-WM is a unified framework integrating combustion physics and machine learning to emulate fine-grained fire spread with physical realism.
- It employs a three-module design—physical simulator, multimodal tokenizer, and diffusion transformer—to enforce spatiotemporal constraints on fire propagation.
- Empirical evaluations show improved fire mask accuracy and IR fidelity via cross-task collaborative training and structured physics-based priors.
The PhysFire-WM (Physics-Informed World Model) framework is a unified modeling system designed for fine-grained fire spread emulation. It integrates explicit combustion physics through simulator-driven priors with a diffusion-based transformer backbone, jointly leveraging multimodal data such as infrared imagery and fire masks. The approach combines advances from both physics-driven simulation and machine learning-based emulators, aiming to capture the inherently multi-scale and spatiotemporal dynamics of fire propagation with both physical realism and geometrically accurate boundary delineation (Zhou et al., 19 Dec 2025).
1. Core Architecture and Components
PhysFire-WM comprises three primary interconnected modules:
- Physical Simulator (): Numerically integrates a fire-energy PDE system to generate a sequence of physically-informed prior masks. Inputs include historical fire boundary masks and environmental data—terrain elevation (), wind velocity field (), and fuel maps—encapsulating both deterministic and stochastic influences on fire evolution.
- Multimodal Tokenizer (): Fuses diverse modalities—including IR video frames, prior masks, control masks, and user prompts—into a unified spatiotemporal context token stream. This tokenizer utilizes a combination of pretrained variational autoencoder (VAE) for IR frame compression, a convolutional encoder for mask processing, and learned embeddings for additional controls and prompts.
- Diffusion Transformer (): Employs a DiT (Diffusion Transformer) backbone that denoises latent tokens, conditioned on the multimodal token stream. Outputs include both predicted infrared (thermal) sequences and fire boundary masks.
A distinctive aspect is the method for integrating physics priors as both explicit (hard constraints) and implicit (feature-level) guidance within the generative pathway. The simulator output is coupled to the diffusion process via a Video Condition Unit (VCU), enabling hard enforcement of physically plausible fireline geometry and upwind propagation patterns during denoising.
2. Physical Simulations and Structured Priors
2.1 Governing PDE and Numerical Approximation
The simulator at the core of PhysFire-WM is governed by a thermal-energy balance PDE:
Here:
- denotes the temperature or fire boundary indicator at spatial point and time
- is wind velocity; 0 is terrain height; 1 is fuel availability
- 2 are physical coefficients
- 3 is a temperature-dependent combustion reaction term
The combustion source 4 is parametrized as a convex combination of historical temperature fields to ensure tractability and differentiability:
5
This framework is solved via finite-difference discretization, providing a sequence of prior masks 6 that encode expected boundary evolution under physical constraints.
2.2 Explicit–Implicit Conditioning
The simulator outputs are injected into the DiT as temporally stacked frames, segregated by control masks. Real IR frames are labeled with "all-zero" masks (preserving content), while prior-mask frames are associated with "all-ones" masks (enforcing physical guidance). This dual-pathway acts as both hard and soft constraints on the diffusion process, biasing sampling toward physically admissible fire front predictions.
3. Cross-task Collaborative Training (CC-Train)
The Cross-task Collaborative Training (CC-Train) strategy is central to PhysFire-WM's learning scheme. It addresses the informational sparsity of binary mask modeling (where gradients vanish in non-burning regions) by joint diffusion-based prediction of IR frames and fire masks. Key mechanisms include:
- Parameter Sharing: Both IR and mask prediction tasks share the encoder, tokenizer, and transformer backbone, with a LoRA (Light-rank Adaptation) layer ensuring co-adaptation of representations.
- Gradient Coordination: The total loss aggregates a diffusion-based IR prediction loss and a binary cross-entropy mask loss:
7
where IR loss is a velocity-field loss, and the mask loss is pixelwise binary cross-entropy. The weighting parameter 8 balances thermal and geometric fidelity.
- Gradient Borrowing: The dense and global gradients from the IR task ameliorate the vanishing gradient problem in the mask prediction, driving shared representations toward features relevant for both thermal field reconstruction and spatial delineation, even in fire-absent zones.
Ablation results demonstrate that training the mask task alone significantly underperforms the coordinated setting, underscoring the efficacy of this cross-task signal sharing (Zhou et al., 19 Dec 2025).
4. Multimodal Data Flow and Representation
PhysFire-WM is designed for heterogeneous and temporally aligned data ingestion:
- Inputs:
- Infrared video (9), normalized to 0 and compressed via a pretrained VAE (1)
- Binary fire masks (2)
- Environmental maps (3), spatially matched to IR/mask resolution
- Control masks and user prompts
- Context Token Fusion:
- Convolutional encoders project prior masks to latent tensors matching the VAE output
- Embeddings represent control masks and prompts
- All features are concatenated and linearly projected, forming the context token sequence (4) for repeated cross-attention in the DiT
This architecture supports flexible conditioning, allowing the model to synthesize both physically consistent and data-driven spatiotemporal patterns.
5. Specialized Loss Functions and Training Schedule
The total loss for PhysFire-WM incorporates three terms:
5
- 6: DiT velocity-field loss for IR prediction
- 7: Binary cross-entropy loss for mask prediction
- 8: L2 regularization for convex weights in the combustion source parametrization
Training alternates mini-batches for each prediction task within each epoch, updating shared parameters per step. AdamW optimizer is used at a learning rate of 9, with LoRA rank 128 facilitating efficient adaptation on multi-GPU setups.
6. Empirical Evaluation and Performance
PhysFire-WM was validated on a drone-collected multimodal fire dataset (226 aligned IR + mask videos). Quantitative metrics for the single-region 17-in/17-out prediction benchmark showed:
| Metric | PhysFire-WM Value | Relative Δ vs. best prior |
|---|---|---|
| Mask AUPRC | 0.89 | ↑6.8% |
| Mask IoU | 0.89 | ↑15.1% |
| IR PSNR [dB] | 23.62 | ↑3.7% |
| IR SSIM | 0.80 | ↑7.1% |
| LPIPS | 0.09 | ↓27.4% |
| FVD | 0.001 | ↓83.3% |
For cross-region generalization on unseen areas, Mask IoU was 0.81, IR PSNR reached 23.26, and FVD was 0.00. Key findings from ablations include that removing the physical prior degrades IR PSNR to 22.76 dB and AUPRC to 0.82; using mask-only training (with prior) yields AUPRC 0.85, while CC-Train raises it to 0.89. Visualizations indicate that fire front shapes produced by PhysFire-WM respect upwind propagation, energy diffusion, and elliptical spread patterns, outperforming purely data-driven or physics-agnostic models (Zhou et al., 19 Dec 2025).
7. Relation to High-resolution Coupled Fire-Atmosphere Models
PhysFire-WM design is informed by principles established in coupled atmosphere–wildland fire models such as WRF-Fire (Mandel et al., 2011), which utilize level-set methods for fireline evolution, semi-empirical spread-rate closure, and explicit physical coupling between fire surface fluxes and atmospheric state. Both systems share core modeling primitives—level-set boundary representations, tile-callable parallelism, heterogeneous-fuel support, and explicit–implicit numerical schemes. However, PhysFire-WM introduces a differentiable, data-driven emulator framework capable of ingesting complex observational modalities and learning cross-domain correlations, while remaining anchored in explicit physical law through its simulator prior component.
A plausible implication is that PhysFire-WM provides a template for integrating simulation-based priors within generative ML architectures for other spatiotemporal dynamical systems, supporting high-resolution forecasting under explicit physical constraints.