FireScope Wildfire Risk Mapping
- FireScope is a visual language model-based framework that fuses satellite imagery, climate data, expert risk rasters, and causal reasoning to produce high-resolution wildfire risk maps.
- It employs a two-stage pipeline—using a chain-of-thought oracle and a FiLM-conditioned vision encoder–decoder—to address multimodal causality and cross-continental generalization challenges.
- Quantitative results validate improved out-of-distribution performance and enhanced interpretability, benchmarked via the comprehensive FireScope-Bench dataset.
FireScope is a visual LLM (VLM)-based framework for high-resolution wildfire risk prediction that integrates satellite imagery, climate data, expert-derived risk rasters, and natural language causal reasoning traces to produce interpretable and generalizable spatial risk maps. The system addresses major challenges in wildfire risk mapping, including the integration of multimodal causal factors, transferability across continental domains, and transparent model interpretability by leveraging a chain-of-thought (CoT) reasoning approach. FireScope-Bench, the accompanying dataset, supports systematic evaluation and benchmarking for multimodal, reasoning-driven wildfire risk models (Markov et al., 21 Nov 2025).
1. Objectives and Key Challenges
The primary objective of FireScope is to predict continuous, high-resolution (30 m/pixel) wildfire risk rasters—quantifying the expected structural damage due to fire—by fusing satellite visual data, climatic variables, and explicit causal reasoning. Wildfire risk prediction presents unique technical hurdles:
- Multimodal causality: Risk is jointly determined by visual cues (vegetation, topography) and non-visual, non-local climatic factors (temperature, precipitation, humidity, wind).
- Generalization: Empirical risk models trained in a specific region may exhibit significant domain shift when deployed cross-continentally, e.g., from the USA to Europe, due to spatial heterogeneity in biomes and fire regimes.
- Interpretability: Standard deep learning methods act as opaque function approximators, lacking mechanisms for experts to understand or interrogate predictive decisions (Markov et al., 21 Nov 2025).
2. FireScope-Bench Dataset
FireScope-Bench is a large-scale, multimodal benchmark designed to evaluate reasoning-intensive wildfire risk modeling:
| Modality | USA Configuration | Europe OOD Configuration |
|---|---|---|
| Risk rasters | USFS "Risk to Potential Structures"; 30 m/pixel, 341×341 tiles, quintile-normalized [0,1] | — |
| Imagery | Sentinel-2, L2A, 2021, 10 m/pixel; 1024×1024 per tile | Control: 2021; Events: year prior to fire |
| Climate data | NASA POWER climatology (monthly, 50 km), 60-dim features | Same aggregation as USA |
| Wildfire masks | — | 3,000 EFFIS burned areas (>5 km²), 2,000 controls (2018–2025) |
Evaluation spans in-distribution (ID) metrics (MSE, MAE, SSIM), OOD event discrimination (Brier, ROC AUC, ECE), pixel segmentation (ROC AUC, [email protected]), and Oracle classification (QWK). The US portion consists of 40,000 train, 4,000 validation, and 4,000 test samples (“large” split), plus prototyping subsets. The European test set allows systematic cross-continental generalization studies (Markov et al., 21 Nov 2025).
3. Model Architecture and Training
FireScope utilizes a two-stage “reasoning → generation” pipeline:
- Chain-of-Thought Oracle
- Backbone: Qwen2.5-VL-7B-Instruct (7B parameters)
- Input: Concatenated climate features and image tokens
- Output:
(a) Natural language reasoning trace (CoT) (b) Discrete scalar risk score in - Supervision: Fine-tuned via Group-Relative Policy Optimization (GRPO) to optimize a reward (accurate, well-formatted outputs).
FiLM-Conditioned Vision Encoder–Decoder
- Encoders: MiT-B5 SegFormer (frozen), AlphaEarth (frozen), U-Net (learnable)
- Conditioning: Scalar risk score injected through FiLM at each decoder block
- Loss:
Spatial metrics:
Training protocols leverage AdamW, standard learning rates per encoder, cosine-annealing schedules, and mixed-precision hardware acceleration (bfloat16, NVIDIA H200). GRPO advantages are estimated using:
The GRPO policy target:
where . Decoder training uses cross-entropy (Oracle, non-CoT) or the composite raster loss (vision module).
4. Quantitative Results
FireScope demonstrates that language-based chain-of-thought reasoning improves OOD generalization and interpretability in wildfire risk prediction. Salient results using the FireScope-Bench protocol:
Oracle Coarse Classification:
- Qwen (CoT): OOD ROC = 0.748, ID QWK = 0.766
- Qwen (no CoT): OOD ROC = 0.701, ID QWK = 0.751
- Climate MLP: OOD ROC = 0.524
- FWI index: OOD ROC = 0.551
- GPT-5: OOD ROC = 0.636
- Raster Generation, OOD (Europe):
- SegFormer baseline (image only): Brier = 0.222, ROC = 0.705
- + Climate: degrades OOD generalization
- + Oracle no-CoT: modest OOD improvement
- + Oracle CoT (FireScope): SegFormer+CoT, Brier = 0.205, ROC = 0.727; U-Net+CoT, Brier = 0.191, ROC = 0.750 (best)
- Pixel segmentation: FireScope achieves pixel-level ROC 0.658 and IoU = 0.184
- Raster Generation, In-Distribution (USA):
- AlphaEarth+climate: MSE ≈ 0.025, SSIM ≈ 0.552, MAE ≈ 0.110
- FireScope (CoT): MSE ≈ 0.028, SSIM ≈ 0.547, MAE ≈ 0.119
Notably, reasoning-driven risk conditioning does not degrade in-distribution (USA) performance, while yielding clear OOD improvements with increasing geographic/temporal divergence from the training set (Markov et al., 21 Nov 2025).
5. Qualitative Analysis and Model Interpretability
FireScope’s chain-of-thought conditioning yields verifiably interpretable reasoning traces:
- CoT Example Analysis: Reasoning traces systematically downweight visually correlated but causally irrelevant features (e.g., shadowed areas during low humidity) and surface physically meaningful climatic determinants.
- Expert Evaluation (blind, 50 areas): Human experts, given only Oracle CoT summaries, produced risk assignments with substantial agreement (QWK=0.33, 0.11) compared to “golden” CoTs (QWK=0.50, 0.59), indicating that the traces encode human-interpretable decision logic.
- Automated Faithfulness: Fidelity testing (intentional CoT perturbation) yields an average of 33% predicted risk shift toward the perturbed level; paraphrase consistency is 91% (paraphrasing CoT yields nearly identical raster outputs).
These results substantiate that natural-language reasoning traces in FireScope carry significant, causally meaningful, and human-auditable signal not present in standard black-box raster prediction models.
6. Generalization Properties, Limitations, and Practical Deployment
FireScope’s generalization improves as spatial and temporal OOD distance from the training set increases (highest accuracy at 60° N and for older European fire events), a plausible consequence of explicit causal reasoning guiding risk assessment. Pure climate features and conventional indices (FWI) perform substantially worse than VLM-based CoT models. However, FireScope’s scalar risk conditioning bottleneck restricts the expressiveness of spatially varying reasoning; further enhancements could involve token-level or region-aware CoT embeddings.
For deployment, the framework’s inference overhead remains compatible with modern GPU/TPU infrastructure, and the transparency of CoT generation enables real-time trust monitoring for expert-in-the-loop systems. Systematic auditing is further facilitated by the pairing of pixel-level risk rasters and corresponding reasoning summaries, supporting both operational fire management and regulatory oversight.
Ethical and societal implications center on the enhanced auditability and explainability of risk models, enabling rapid evaluation by fire-modeling agencies and first responders. FireScope-Bench is positioned as an open, extensible resource for future advancement of interpretable, robust, and generalizable spatial risk modeling.