SustainFM Benchmark: Geospatial AI for SDGs
- SustainFM Benchmark is a comprehensive framework for evaluating pre-trained geospatial models using accuracy metrics, energy efficiency, and societal impact assessments.
- It leverages diverse Earth Observation data and modular architectures with task-specific decoders to enable efficient fine-tuning and broad transferability.
- The evaluation protocol incorporates metrics like RMSE, mF1, and CO₂ emissions reporting, ensuring model performance aligns with the UN Sustainable Development Goals.
SustainFM Benchmark refers to a comprehensive benchmarking framework for evaluating geospatial foundation models (FMs) in the context of Earth Observation (EO) and their alignment with the United Nations Sustainable Development Goals (SDGs). It extends benchmarking beyond traditional accuracy to encompass real-world utility, societal impact, transferability, and sustainability metrics including energy and resource efficiency (Ghamisi et al., 30 May 2025).
1. Rationale and Scope
SustainFM addresses limitations in prior geospatial and sustainability-focused AI evaluation: disparate datasets, inconsistent metrics, and the lack of system-level sustainability criteria. The framework specifically targets SDG-centric tasks to determine how large, pre-trained geospatial models can be operationalized for societal benefit. Sixteen learning tasks are covered, including asset wealth estimation, health indices, environmental hazard detection, and disaster mapping. Each task is mapped to relevant SDGs, utilizing EO data spanning over 200 countries and regions. This approach ensures wide coverage and societal relevance, using diverse benchmarks to evaluate model utility in supporting sustainable development.
2. Data Modalities and System Architecture
SustainFM is built on diverse EO data sources:
- Optical imagery: Landsat, Sentinel-1/2, Google Earth, PlanetScope, Gaofen-2, VIIRS, PlanetScope, etc.
- Multisensor fusion: Supporting a range of spatial resolutions from sub-meter to 30m.
- Survey and tabular data: E.g., Demographic and Health Surveys (DHS) for asset/health estimation.
Architecture is modular: a pre-trained FM encoder is paired with task-specific decoders (e.g., UperNet for segmentation, Siamese UperNet for change detection, fully-connected layers for regression). Fine-tuning operates by freezing the encoder, updating only the decoder, enabling efficient adaptation with minimal labeled data. This design allows broad transfer across tasks, rapid convergence, and resource efficiency.
3. Evaluation Dimensions and Metrics
SustainFM's multidimensional evaluation protocol is as follows:
- Accuracy metrics are task-dependent: root mean squared error (RMSE) for regression, mean F1 (mF1) for classification/segmentation.
- Generalization/transferability: Ability to adapt pre-trained FMs to new geographies and modalities in few-shot or zero-shot setups.
- Data efficiency: Performance with limited labeled data compared to full retraining.
- Energy efficiency: Training/inference energy consumption (kWh) and CO₂ emissions (kg), measured via CodeCarbon.
- Operational impact: Convergence speed, measured by epochs or wall-clock time to reach target accuracy.
A sample RMSE formula is:
and mean F1:
where is the number of classes.
4. Empirical Insights and Comparative Results
Foundation models are shown to deliver strong generalization and typically outperform traditional approaches like ViT and ResNet-50 across SDG tasks. Their pretrained representations allow accurate adaptation to diverse domains (e.g., flood mapping, urban change, health assessment) with minimal data. Results indicate decoder-only fine-tuning reduces training time and energy (with CO₂ reductions quantifiable in practical terms), without performance degradation in most cases. However, FMs are not universally superior; results depend on data modality, task difficulty, and model architecture.
5. Sustainability, Impact, and Recommendations
SustainFM integrates sustainability and ethical principles:
- Reports energy and carbon footprint for all experiments, highlighting resource costs of large-scale models.
- Encourages impact-driven evaluation, shifting focus from pure accuracy to societal benefit and responsible model deployment.
- Advocates for transparent reporting of environmental costs and careful mitigation of dataset and model biases.
- Recommends domain- and physics-informed pretraining, and cross-disciplinary collaboration for actionable SDG outcomes.
Key examples include:
- Poverty and health prediction (SDG 1, 3, 4, 5)—combining EO and DHS survey features.
- Flood and wildfire mapping (SDG 13, 15)—robust, fast disaster response with Ombria and HLS datasets.
- Urban/industrial change and conflict zone mapping (SDG 9, 11, 16)—fine-grained detection (Google Earth, PlanetScope, GAZADeepDav).
6. Challenges and Future Research Directions
Open technical and societal questions include:
- Handling severe domain shift, especially across sensor types and geographies.
- Fairness and explainability in high-stakes decision contexts.
- Sustaining benchmark relevance as new EO and multimodal data become available.
- Integrating additional modalities (e.g., social media, Internet-of-Things sensors).
- Design of methods that align resource efficiency with real-world impact (e.g., scaling, lifelong/online learning for evolving EO data).
7. Significance in Responsible AI and SDG Research
SustainFM catalyzes a new standard for benchmarking geospatial AI against SDGs, foregrounding resource efficiency, societal alignment, and environmental accountability. By complementing accuracy metrics with transferability and sustainability evaluations, it provides a framework for the development, deployment, and dissemination of AI methods that can directly support global sustainable development efforts (Ghamisi et al., 30 May 2025).