Earth Embeddings Reveal Diverse Urban Signals from Space

Published 3 Apr 2026 in cs.LG and cs.CY | (2604.03456v1)

Abstract: Conventional urban indicators derived from censuses, surveys, and administrative records are often costly, spatially inconsistent, and slow to update. Recent geospatial foundation models enable Earth embeddings, compact satellite image representations transferable across downstream tasks, but their utility for neighborhood-scale urban monitoring remains unclear. Here, we benchmark three Earth embedding families, AlphaEarth, Prithvi, and Clay, for urban signal prediction across six U.S. metropolitan areas from 2020 to 2023. Using a unified supervised-learning framework, we predict 14 neighborhood-level indicators spanning crime, income, health, and travel behavior, and evaluate performance under four settings: global, city-wise, year-wise, and city-year. Results show that Earth embeddings capture substantial urban variation, with the highest predictive skill for outcomes more directly tied to built-environment structure, including chronic health burdens and dominant commuting modes. By contrast, indicators shaped more strongly by fine-scale behavior and local policy, such as cycling, remain difficult to infer. Predictive performance varies markedly across cities but remains comparatively stable across years, indicating strong spatial heterogeneity alongside temporal robustness. Exploratory analysis suggests that cross-city variation in predictive performance is associated with urban form in task-specific ways. Controlled dimensionality experiments show that representation efficiency is critical: compact 64-dimensional AlphaEarth embeddings remain more informative than 64-dimensional reductions of Prithvi and Clay. This study establishes a benchmark for evaluating Earth embeddings in urban remote sensing and demonstrates their potential as scalable, low-cost features for SDG-aligned neighborhood-scale urban monitoring.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper demonstrates that Earth embeddings, especially AlphaEarth, consistently achieve higher R² scores across various urban indicators, highlighting superior information density and scalability.
The methodology employs a unified supervised learning pipeline on datasets from six US metropolitan areas to reveal spatial heterogeneity in domain-specific urban signal predictability.
The study underscores the need for city-specific calibration and fusion strategies to enhance operational reliability and transferability of geospatial foundation models.

Authoritative Summary of "Earth Embeddings Reveal Diverse Urban Signals from Space" (2604.03456)

Context and Motivation

Urban governance increasingly demands high-frequency, comparable neighborhood-scale data for actionable monitoring aligned with the SDGs. Traditional sources—censuses, surveys, administrative records—suffer from latency, inconsistencies, and fragmentation, impeding timely responses. Earth Observation (EO) data is a promising alternative, yet prior EO-based urban proxies relied on labor-intensive, task-specific feature engineering with poor spatial transferability. Geospatial foundation models such as AlphaEarth, Prithvi, and Clay represent a paradigm shift, producing general-purpose "Earth embeddings" from globally distributed satellite imagery. However, their utility for capturing latent human-centric urban signals at neighborhood granularity is largely unknown.

Benchmarking Earth Embedding Families

This study systematically benchmarks AlphaEarth, Prithvi-EO-2.0, and Clay embeddings for neighborhood-scale urban signal prediction across six U.S. metropolitan statistical areas (MSAs)—Atlanta, Chicago, Houston, Los Angeles, New York, and Seattle—over four years (2020–2023). Four domains are considered: crime, income, health, and travel behavior. 14 indicators represent these domains, including obesity, diabetes, mental/physical health, median income, violent/petty crime, and commuting mode shares.

Figure 1: Study area, domain coverage, population and land area, and conceptual framework for embedding-based urban signal prediction.

A unified supervised learning pipeline evaluates embedding efficacy under global, city-wise, year-wise, and city-year settings, using OLS, Random Forest, XGBoost, and LightGBM. Models are benchmarked via test $R^2$ on held-out data, ensuring rigorous performance metrics.

Urban Signal Predictability and Domain Dependence

Earth embeddings demonstrate significant capacity to recover urban variation. AlphaEarth consistently achieves the highest $R^2$ across domains, followed by Prithvi and then Clay. Notable results show:

Health indicators: Best predicted, e.g., $\%$ Obesity ( $R^2=0.69$ AlphaEarth, $0.67$ Prithvi), $\%$ Inactivity ($0.63$ AlphaEarth).
Travel modes: Car and transit shares highly predictable ( $\%$ Drive Alone $0.74$ AlphaEarth), whereas cycling is poorly predicted ( $\%$ Bike $R^2$ 0 AlphaEarth).
Income and crime: Moderate $R^2$ 1 (log-income $R^2$ 2 AlphaEarth), violent crime more predictable than petty crime.

Domain-level aggregation reinforces AlphaEarth’s superiority: mean $R^2$ 3 of 0.59 (health), 0.48 (travel), 0.44 (income), and 0.42 (crime). Prithvi trails in all domains, Clay lags especially for income.

Figure 2: Comparative predictive performance ( $R^2$ 4) of AlphaEarth, Prithvi, and Clay embeddings for 14 urban indicators and domain aggregates.

Upper-tail performance (city-year $R^2$ 5 distribution) favors AlphaEarth, with frequent high $R^2$ 6 values particularly in health and travel.

Cross-City Heterogeneity and Urban Form Modulation

Predictive efficacy varies nontrivially across MSAs. Atlanta, Seattle, Chicago, Los Angeles are “easier” cities, e.g., AlphaEarth $R^2$ 7 for health/income; Houston and New York are “harder,” especially for travel ( $R^2$ 8 for Houston). In some domains, Clay equals or outperforms AlphaEarth (e.g., crime in Houston), reflecting embedding complementarity.

Hierarchical clustering reveals two city clusters: one (Seattle, Atlanta, Chicago) with uniformly high predictability; another (Houston, Los Angeles, New York) with weaker performance in certain domains.

Figure 3: City-wise domain prediction, clustering of MSAs, and task-specific relationships between urban form indicators and performance.

Exploratory analysis links urban form to prediction skill: negative correlation between density and crime/income predictability (Spearman’s $R^2$ 9 of $\%$ 0 crime–density), positive association between walkability and health ( $\%$ 1). Functional entropy diminishes travel predictability. The underlying mechanism is domain- and city-specific alignment of observable spatial signatures with latent urban signals.

Temporal Robustness

Year-wise analysis (2020–2023) shows temporal consistency in domain-level $\%$ 2 for all embeddings. AlphaEarth’s health prediction remains stable ( $\%$ 3), travel ( $\%$ 4), income ( $\%$ 5), and crime ( $\%$ 6). Prithvi exhibits greater annual fluctuation (e.g., crime dips in 2021), while Clay is temporally flat but at lower accuracy. Minimal leakage or drift reflects the embeddings’ encoding of stable structural correlates rather than transient dynamics.

Figure 4: Domain-level annual $\%$ 7 for AlphaEarth, Prithvi, and Clay, demonstrating strong temporal robustness.

Information Density and Representation Efficiency

AlphaEarth’s compact 64-d representation is more information-dense than 64-d reductions of the high-dimensional Prithvi and Clay (768-d and 1024-d respectively). Dimensionality reduction of Prithvi and Clay via FA, Isomap, kPCA, PCA, and random projection always decreases performance vs. their original size. Notably, random projections retain relatively more utility for Clay, implicating distributed encoding versus concentration of variance.

Figure 5: Domain-wise mean $\%$ 8 for original and reduced representations; AlphaEarth’s 64-d embeddings outperform dimensionally-matched variants of Prithvi and Clay.

This finding underscores the importance of representation quality and domain alignment rather than vector size. Compact embeddings substantially lower storage and computational costs without sacrificing predictive skill.

Practical and Theoretical Implications

Scalable low-cost proxies: Earth embeddings, particularly AlphaEarth, serve as effective features for high-frequency urban monitoring, especially for health and dominant travel modes.
Domain/deployment caution: There is significant spatial heterogeneity and city dependence, demanding city-specific calibration and domain-shift diagnostics in operational settings.
Complementarity and multi-embedding fusion: Cross-domain embedding complementarity (e.g., Clay's occasional domain superiority) motivates ensemble approaches for robust transferability.
Temporal stability: Embeddings’ robustness across years positions them for repeated monitoring without frequent retraining, facilitating longitudinal urban analytics.
Representation efficiency: Compacted information-dense embeddings are preferable for practical implementations, aligning with the emerging trend in treating embeddings as geospatial layers.
Limitations and future work: Further global validation, finer spatial granularity (addressing MAUP), longer temporal windows, and multi-modal fusion (incorporating behavioral and ground-level signals) are critical future directions.

Conclusion

This work provides a rigorous baseline for evaluating geospatial foundation models’ capability to capture diverse urban signals from overhead imagery. AlphaEarth embeddings demonstrate superior information density and cross-domain predictive power, yet performance is domain- and city-specific and temporally robust. Practical deployments should leverage compact, high-quality embeddings, prioritize local calibration, and consider fusion strategies to overcome domain blind spots and support scalable urban analytics aligned with SDG monitoring. Further work should extend these findings to a broader array of urban contexts and integrate behavioral contingencies for comprehensive human-centric urban sensing.