FireScope-Bench: Wildfire Risk Benchmark
- FireScope-Bench is a large-scale, multimodal dataset and benchmark that integrates high-resolution Sentinel-2 imagery, climate normals, and expert risk rasters for wildfire prediction.
- It supports rigorous model evaluation using metrics like MSE, MAE, SSIM, ROC AUC, and IoU in both in-distribution and out-of-distribution settings.
- The framework’s chain-of-thought vision-language model enables interpretable spatial predictions, enhancing generalization and causal reasoning in risk assessment.
FireScope-Bench is a large-scale, multimodal dataset and benchmark developed for high-resolution, reasoning-intensive wildfire risk prediction. It combines Sentinel-2 satellite imagery, coarse-resolution climate normals, and expert-defined continuous wildfire risk rasters across the continental United States, supplemented by actual wildfire events and matched control tiles from Europe to enable systematic evaluation of model generalization across geographic regions. FireScope-Bench supports the FireScope framework, which incorporates a chain-of-thought (CoT) Vision-LLM (VLM) Oracle and a lightweight vision encoder–decoder, facilitating interpretable, causally grounded spatial prediction at the raster level (Markov et al., 21 Nov 2025).
1. Dataset Composition and Preprocessing
FireScope-Bench is constructed to maximize spatial, temporal, and multimodal diversity:
- Spatial Coverage and Tiling:
- United States: The benchmark covers 5.7 million km², divided into 50,000 geographically stratified, non-overlapping tiles (341×341 pixels, ≈100 km² each, 30 m/pixel). The splits are 40,000 training, 4,000 validation, and 4,000 test tiles.
- Europe: 3,000 tiles recording actual wildfire events (2018–2025) and 2,000 control tiles sampled per country, used exclusively for out-of-distribution testing.
- Sentinel-2 Imagery:
- Imagery is provided at 10 m/pixel, Level-2A (bottom-of-atmosphere reflectance).
- US risk rasters and European controls reflect the summer (June 22–September 22) of 2021, while European fire events use imagery from the summer preceding each event.
- Preprocessing includes cloud masking and construction of a pixel-wise median mosaic, followed by band-wise z-score normalization:
where is the median seasonal reflectance for pixel , band .
Climate Normals (NASA POWER):
- Variables: near-surface temperature, precipitation, relative humidity, wind speed, wind direction, aggregated monthly (12 months), yielding a vector per tile.
- Spatial resolution is 50 km; each climate vector is interpolated to tile centroid.
- Climate features are independently standardized:
Expert-Defined Risk Raster:
- Source: Wildfire Risk to Communities project, providing continuous values for expected consequence to built structures.
- The raster has a native resolution of 30 m/pixel and is quintile-transformed for even distribution of risk values.
- No smoothing is applied, preserving high-frequency spatial detail.
- Input Tensor Construction:
Sentinel imagery bands (normalized) and broadcast climate vectors are concatenated per pixel:
Resulting in .
2. Benchmark Protocol and Evaluation
Benchmarking in FireScope-Bench is structured to assess both in-distribution performance (USA) and cross-continental generalization (Europe):
- Data Splits:
- USA: 40,000 train, 4,000 val, 4,000 test.
- Quick-experiment subset: 1,000 train, 100 val, 100 test tiles.
- Europe: 3,000 wildfire event tiles, 2,000 control tiles.
- Evaluation Metrics:
- In-Distribution (Continuous Raster):
- Mean Squared Error (MSE):
- Mean Absolute Error (MAE):
- Structural Similarity Index (SSIM) [Wang et al. 2004]: computed with Gaussian window.
- Out-of-Distribution (Wildfire Events):
- Brier Score:
- ROC AUC:
- Expected Calibration Error (ECE), using 15 bins.
- Intersection over Union (IoU) for burned-area masks.
- Ordinal Area-level Risk: Quadratic Weighted Kappa (QWK) over 10 bins.
- Baselines:
Each vision backbone — U-Net, SegFormer MiT-B5, AlphaEarth embedding — is tested under four conditioning regimes: 1. image only, 2. image + raw climate vector, 3. image + Oracle scalar (VLM pre-CoT), 4. image + CoT Oracle.
3. FireScope Model Framework and Chain-of-Thought Oracle
Built on FireScope-Bench, the FireScope model pioneers explicit reasoning in spatial prediction:
- Model Architecture:
- Stage 1: Chain-of-Thought Oracle (VLM) fine-tuned by Group-Relative Policy Optimization (GRPO RL), outputs a stepwise reasoning trace and scalar risk estimate :
- Stage 2: Vision Encoder–Decoder , conditioned via FiLM blocks on , generates risk raster :
Oracle Fine-Tuning (GRPO):
- Reward combines classification () and output formatting ():
- The GRPO objective uses PPO-style clipping and a KL-regularizer:
where .
Encoder–Decoder Training:
Loss function combines Smooth-L1, SSIM, and gradient matching:
FiLM conditioning applies per-block, using for affine transformations.
- Training Schedule:
- SegFormer: learning rate , 500 epochs
- U-Net/AlphaEarth: learning rate , 1000 epochs.
4. Generalization and Model Interpretability
FireScope-Bench facilitates robust assessment of spatial generalization and interpretability:
- Cross-Continental Generalization:
- SegFormer conditioned on CoT Oracle achieves Brier score ≈ 0.205 vs. 0.222 for image-only, ROC AUC ≈ 0.727 vs. 0.705, and [email protected] ≈ 0.184 vs. 0.179 when trained in the USA and tested in Europe. Typical pixel-level ROC AUC gains are ≈ 0.04, IoU gains ≈ 0.01. In-distribution accuracy remains within ±5% of best baseline values (MSE/SSIM/MAE).
- Reasoning Trace Examples:
- Oracle traces follow causal logic, identifying vegetation density, dryness, humidity, wind, and topography and delivering a stepwise conclusion (e.g., “FINAL ANSWER: 7”).
- Conditioning vision models on the CoT Oracle results in pixel-level risk predictions that align with expert expectations, notably on slopes and ridges.
- Interpretability Metrics:
- Fidelity: Artificial CoT perturbations shift raster predictions ≈ 33% toward the manipulated extreme.
- Consistency: Paraphrased CoTs yield highly similar outputs (consistency ≈ 0.91).
- Expert Study: Domain experts using CoT summaries reach QWK of 0.33 and 0.11; GPT-5 synthesized CoTs reach QWK up to 0.59, indicating meaningful, improvable causal conveyance.
5. Impact and Applications
FireScope-Bench enables systematic investigation of reasoning-driven spatial modeling for wildfire risk, supporting cross-continental generalization studies:
- Comparative Evaluation:
Provides rigorous baselines and metrics for multimodal raster prediction, benchmarking both standard and reasoning-enhanced approaches in variable conditions.
- Generalization:
Demonstrates that explicit language-based causal reasoning in VLMs provides a powerful prior, improving out-of-distribution performance without sacrificing spatial fidelity.
- Interpretability:
Facilitates quantification and analysis of reasoning traces, supporting evaluation of model fidelity and consistency with expert domain knowledge.
- Research Utility:
FireScope-Bench and the FireScope framework present the first empirical evidence that language-based reasoning enhances generalization for visual generation tasks in wildfire risk modeling and establish a foundation for developing interpretable, robust spatial models that integrate multimodal evidence (Markov et al., 21 Nov 2025).
6. Future Directions and Research Significance
The release of FireScope-Bench and the FireScope framework marks a foundational step for reasoning-driven spatial modeling:
- Extensibility:
The dataset and benchmark permit expansion to further climatic regions and hazard domains, supporting generalized research in causal, multimodal risk evaluation.
- Methodological Advancement:
The paradigm of chain-of-thought VLM guidance for visual generation is applicable in broader geospatial context, including climate resilience, disaster prediction, and infrastructure planning.
- Interpretability Studies:
Results motivate future investigation of CoT design, expert feedback loops, and causal trace optimization to further close the gap between human-expert and model reasoning.
This suggests that FireScope-Bench will serve as a robust testbed and foundation for developing, rigorously evaluating, and interpreting generalizable, multimodal approaches to wildfire and other geospatial risk prediction.