Geometry Foundation Models Overview
- Geometry Foundation Models (GFMs) are deep learning architectures that embed continuous 2D and 3D geometric data for direct extraction of physical measurements.
- They employ frozen backbones and lightweight probes trained via contrastive or self-supervised methods to recover metrics like joint angles, depth, and object pose.
- GFMs enable cross-modal applications in vision, language, and graphs, allowing for task-specific geometry extraction without requiring network retraining.
A Geometry Foundation Model (GFM) is a foundation model architecture that encodes fine-grained, continuous geometric structure directly in its learned features, supporting efficient extraction and transfer of 2D and 3D physical measurements across a wide array of vision, vision-language, and graph domains. In contrast with language-centric models or classical geometry pipelines, GFMs are trained (often in self-supervised, contrastive, or hybrid paradigms) to enable direct geometric readouts—such as joint angles, depth, pose, or object shape—using lightweight probes or adapters without requiring per-task network retraining. GFMs appear as both vision-based and graph-based architectures, converging on the principle that transferable, structured geometric knowledge can be encoded and retrieved in a unified model backbone.
1. Conceptual Foundations of Geometry Foundation Models
GFMs are characterized by their capacity to encode metric, topological, and continuous geometric information in model features, such that it can be efficiently decoded or utilized for downstream tasks. Central to the GFM paradigm is the decoupling of the foundation model backbone from task-specific decoders, typically via "probing" or linear readouts. For visual GFMs, this involves freezing a vision-LLM (VLM) or vision transformer (ViT) backbone after pretraining, then attaching specialized lightweight linear or reduced-rank probes to recover physical quantities such as hand joint angles, object pose, or camera intrinsics (Shkolnikov, 6 Mar 2026). In graph domains, GFMs encode the topology and metrics of input graphs as embeddings in Riemannian or product-manifold latent spaces, supporting cross-domain structural inference and transfer (Yu et al., 23 Mar 2026, Sun et al., 5 Feb 2025, He et al., 9 May 2026).
A key insight is that standard VLMs and VLM-derived models encode much richer metric geometry in their features than their language output paths can express—a phenomenon best quantified by comparing probing accuracy on continuous geometric measurements versus text-based predictions (Shkolnikov, 6 Mar 2026).
2. Methodological Components: Architecture, Training, and Probing
Visual GFMs
The core methodology in vision-centric GFMs is as follows (Shkolnikov, 6 Mar 2026):
- Frozen Feature Extraction: An RGB image is fed into a frozen encoder (e.g., CLIP, DINOv3, SigLIP2), and features are taken from an intermediate or late layer .
- Feature Aggregation: Spatial tokens are mean-pooled (excluding classification or register tokens) to yield features .
- Linear or Reduced-Rank Probe: A lightweight linear probe or reduced-rank ridge regression (RRR) is fit over continuous geometric targets: , with -truncated parameters and , typically totaling 0 parameters for hand-pose (Shkolnikov, 6 Mar 2026).
- Training Objective: Probes are fit using ridge regression with cross-validated penalty 1 and rank 2, typically 3, 4–5.
Performance Metrics:
- Mean Absolute Error (MAE): 6
- Coefficient of Determination 7
Crucially, contrastive, self-supervised, and hybrid pretraining objectives result in nearly identical 8 performance (9 for hand-pose), asserting that training objective drives geometric fidelity more than architectural details (Shkolnikov, 6 Mar 2026).
Graph GFMs follow an analogous principle, mapping graphs or substructures into geometric manifolds (often via Riemannian, product-manifold, or metric-measure spaces), aligning arbitrary graphs to geometric bases or intrinsic manifolds, and using structure-aware re-encoding to support cross-domain inference (Yu et al., 23 Mar 2026, Sun et al., 5 Feb 2025, He et al., 9 May 2026, Liu et al., 11 May 2026).
3. Representative GFM Designs Across Modalities
| Class | Core Representation | Extraction Mechanism | Exemplary models |
|---|---|---|---|
| Visual GFM | Deep backbone features | Linear / RRR probe | SigLIP2, DINOv3, CLIP, InternViT (Shkolnikov, 6 Mar 2026) |
| Graph GFM | Riemannian manifold, product bundle, or GW barycenter | Projection, attention, mixture-of-experts | RiemannGFM (Sun et al., 5 Feb 2025), SCGFM (He et al., 9 May 2026), R-GFM (Liu et al., 11 May 2026) |
| VLA+GFM | Geometry tokens into VLA policy head | Cross-attention, token fusion | VGGT, GR00T-N1.5 (Yang et al., 23 May 2026) |
Details of Selected GFMs:
- Image-based/Hand-pose GFM: ~6,000-parameter RRR probe achieves MAE 0 and 1, notably outperforming text-path output by a factor of 3 in error (2 MAE for text) (Shkolnikov, 6 Mar 2026).
- Graph GFMs: RiemannGFM uses a product of hyperbolic and spherical manifolds representing trees/cycles, and learns geometry using contrastive objectives over tangent vectors (Sun et al., 5 Feb 2025). SCGFM encodes graphs as metric-measure spaces, aligns them to geometric bases via Gromov-Wasserstein distances, and re-encodes heterogeneous node features through the learned transport plan (He et al., 9 May 2026). R-GFM introduces a multi-scale graph-of-graphs structure and a mixture-of-Riemannian-experts, treating curvature and scale as primary modeling axes, and demonstrating up to 49% gains in 1-shot node classification (Liu et al., 11 May 2026).
4. Empirical Findings and Theoretical Insights
Layerwise and Architectural Analysis in Visual GFMs
- Universal Mid-layer Geometry: Across all tested architectures, geometric extraction accuracy (3) is universally peaked at intermediate transformer layers (normalized depth 4), with attention heads in layers 18–22 aggregating disproportionate geometric signal (Shkolnikov, 6 Mar 2026).
- Functional vs. Representational Convergence: Despite sharing low representational similarity (CKA as low as 0.41 between models), encoders from different paradigms exhibit statistically equivalent geometric accuracy, as formalized via TOST equivalence tests (margin 5, 6 with Holm correction) (Shkolnikov, 6 Mar 2026).
- Decoding Bottleneck: Autoregressive language decoders and text output pathways discard fine geometry, with accuracy declining after early decoder layers. LoRA adapters can partly recover the bottleneck, enabling text-readable angles within 6.5° MAE (7–79% of probe 8), but text generation remains a limiting step (Shkolnikov, 6 Mar 2026).
Graph GFM Results
- Structural Vocabulary and Manifold Choice: Graph domain GFMs benefit from encoding shared substructures (trees, cycles) using product Riemannian geometry, supporting zero/few-shot transfer and outperforming text-based and Euclidean baselines in arbitrary domains (Sun et al., 5 Feb 2025, He et al., 9 May 2026).
- Adaptive Curvature and Scale: Mixture-of-experts and graph-of-graphs approaches (R-GFM) dynamically select curvature and sampling scale per node and per domain, strictly improving generalization and reducing noise versus fixed-hop or single-curvature models, with formal guarantees (Liu et al., 11 May 2026).
5. Extension to 3D Perception, Embodied AI, and World Modeling
GFMs have materially impacted embodied AI, robotics, and world modeling:
- Vision-Language-Action Fusion: Injecting GFM tokens into VLAs (e.g., via cross-attention or spatial forcing) closes the geometric gap between policy and perception. Early-fusion with gating provides the most robust gains in real-robot and multi-task settings (Yang et al., 23 May 2026).
- Dynamic and Temporal Geometry: Dynamic GFMs integrate temporally consistent point map tokens and compress scene dynamics, supporting robust, efficient navigation and interaction in dynamic or real-world scenes (Liu et al., 22 Mar 2026).
- Forecasting in GFM Latents: Predicting the evolution of GFM features themselves, rather than pixels, yields temporally coherent world models with 3–5× faster inference and improved depth/point-cloud accuracy compared to pixel-based baselines (Sun et al., 13 Mar 2026).
6. Implementation, Scalability, and Practical Considerations
- Probe-Only versus Finetuning: The probe-only paradigm (frozen backbone with task-specific probes) offers modularity and minimal compute, with negligible risk to the performance of unrelated tasks. LoRA and full finetuning enable text output and slightly higher fidelity, but at increased cost and with potential negative transfer (Shkolnikov, 6 Mar 2026).
- Multi-Task Design: GFMs can simultaneously handle multiple geometric tasks (e.g., hand pose, head pose, object pose, gaze, intrinsics) by attaching independent reduced-rank probes, with per-task overhead 9 of backbone size and data requirements of few thousand labeled examples per new task (Shkolnikov, 6 Mar 2026).
- Scalability: Graph GFMs scale via efficient variants of Gromov-Wasserstein alignment, adaptive subgraph sampling, and constant-curvature product bundles, supporting embeddings and inference for graphs with up to 5M nodes within commodity GPU memory (He et al., 9 May 2026, Liu et al., 11 May 2026).
7. Broader Implications and Outlook
GFMs substantiate the claim that powerful continuous geometric knowledge can be stored and transferred by large models in the absence of explicit supervision per task. The distinction between geometry-sensing (probe extraction from a frozen backbone) and geometry-telling (text/pathway readout) is formalized and measured, quantifying a "text bottleneck" and providing clear recipe-based methodologies for practitioners (Shkolnikov, 6 Mar 2026). Progress in architectural integration—across modalities (vision, graph, language), temporal domains, and multi-task agents—indicates that geometry-centric modeling will underpin the next wave of foundation models in embodied AI, simulation, and graph reasoning (Sun et al., 5 Feb 2025, Yu et al., 23 Mar 2026, Yang et al., 23 May 2026).
Active challenges include scalable manifold fitting, efficient Riemannian optimization, robust real-world geometry under distribution shift, and unified multimodal integration with LLMs. Nonetheless, GFMs provide a principled, empirically verified pathway for encoding and exploiting geometry by large-scale, generalizable models.