Three-Dimensional Foundation Models (3DFMs)
- Three-Dimensional Foundation Models (3DFMs) are scalable models that learn volumetric representations from diverse 3D data using self-supervised or weakly supervised methods.
- They employ architectures like 3D CNNs, vision transformers, and multi-modal fusion to address tasks in medical imaging, materials science, and urban reconstruction.
- Pretrained on large unlabeled datasets via masked modeling, contrastive learning, and distillation, 3DFMs generalize effectively in data-scarce and heterogeneous environments.
Three-Dimensional Foundation Models (3DFMs) comprise a class of large-scale, general-purpose models designed to learn transferrable volumetric representations from diverse 3D data at scale, using self-supervised or weakly supervised objectives. These models—built on architectures such as 3D convolutional neural networks, transformers with global attention, or multi-modal fusion—seek to provide domain-generalizable 3D priors to downstream tasks including medical image analysis, materials informatics, point cloud segmentation, 3D scene reconstruction, and geometric reasoning with image collections. 3DFMs leverage structured architectural priors and cross-modal self-supervision, yielding robust representations that generalize in data-scarce and heterogeneous real-world scenarios, often surpassing models trained from scratch or domain-specific self-supervised variants.
1. Architectural Frameworks and Data Representations
Contemporary 3DFMs adopt a variety of architectural strategies, closely linked to the data modality and end-use domain:
- 3D Vision Transformers and Convolutions: For volumetric or voxelized data, models such as the polycrystal informatics 3DFM utilize a hierarchical pipeline: input 3D tensors (e.g., with quaternion orientation channels) are patchified, passed through Conv3D layers for token embedding, and encoded with 3D positional encodings before multi-block transformer processing. The global [CLS] token summarizes spatial–orientational dependencies (Wei et al., 7 Dec 2025).
- 3D Medical Foundation Models: In CT-FM and Triad, the encoder comprises either a large-scale 3D residual network (CT-FM) or 3D vision transformers/autoencoders (Triad), augmented with positional encodings to retain geometric context (). Semantic alignment heads tie image representations to structured metadata, driving modality-aware factorization in the latent space (Pai et al., 15 Jan 2025, Wang et al., 19 Feb 2025).
- Image-Driven 3D Reasoning Backbones: A class of 3DFMs for scene geometry (VGGT, WorldMirror, π³) employ per-frame image patch encoders, a shared transformer with alternating intra-/cross-view modules, and separate lightweight decoder heads for pose, depth, and point-cloud regression. Backbones learn an internal “3D language,” facilitating geometric inference from unstructured image bundles (Zhang et al., 27 Nov 2025, Liu et al., 29 Sep 2025).
- Point Cloud and Urban Data Modeling: For urban-scale data, models integrate point-based transformers or sparse 3D CNNs with occupancy/mesh-based outputs, as in studies using the BuildingWorld dataset (Huang et al., 9 Nov 2025). Data are often normalized, voxelized, and augmented with multiple geometric features.
2. Pretraining Paradigms and Self-Supervised Objectives
Foundational to 3DFMs is large-scale pretraining, commonly on unlabeled or weakly labeled corpora:
- Masked Modeling and Denoising: Polycrystal 3DFMs mask random patch subsets and reconstruct quaternion fields, with decoder loss:
driving the encoder to capture nonlocal spatial–orientational structure (Wei et al., 7 Dec 2025).
- Contrastive Instance Discrimination: CT-FM leverages SimCLR-style intra-volume contrast, maximizing agreement between augmented volumetric views and repulsing all others:
- Vision–Language Semantic Alignment: Triad forms joint visual–textual supervision, aligning visual latents with organ-independent metadata embeddings via a log-ratio loss:
- 2D→3D Distillation: D-DITR regresses per-point features of a 3D backbone toward projected 2D feature map targets from a fixed vision transformer, under a cosine distance loss, bootstrapping 3D structure from massive 2D pretraining (Zeid et al., 24 Mar 2025).
- Patch/Token-Level Global Attention: In image-driven 3DFMs, all-view attention mechanisms aggregate global geometric cues, enabling emergent geometric priors beyond explicit correspondence (Zhang et al., 27 Nov 2025).
3. Core Domains and Benchmark Results
3DFMs have been validated across a spectrum of domains and tasks:
- Medical Imaging (CT-FM, Triad):
- Whole-body anatomical segmentation: Mean Dice 0.8981 (CT-FM), with robust error margins (Pai et al., 15 Jan 2025).
- Tumor/lesion segmentation: Triad yields up to +17.97% Dice improvement on lung tumor tasks over baselines (Wang et al., 19 Feb 2025).
- Registration and classification: Consistent accuracy boosts, with Triad–SwinUNETR on OASIS brain MRI showing 88.70% Dice (+6.90%) (Wang et al., 19 Feb 2025).
- Materials Science:
- Stiffness prediction: Pretrained encoder achieves (baseline: ) (Wei et al., 7 Dec 2025).
- Nonlinear mechanics: ODMN–3DFM pipeline attains mean relative errors ≤3.93% on unseen microstructures (Wei et al., 7 Dec 2025).
- Urban Environments:
- 3D building reconstruction: BuildingWorld workflows recommend IoU, Chamfer Distance, and point-to-mesh metrics; pretrained Point Transformers facilitate sim-to-real transfer (Huang et al., 9 Nov 2025).
- 3D Vision and Scene Geometry:
- Dense novel view synthesis: VGGT-X scales to 1000+ images, closing the SSIM gap (0.7821 vs. 0.8148 PSNR; train vs. COLMAP-initialized) (Liu et al., 29 Sep 2025).
- Extreme-view pose estimation: Post-alignment, VGGT achieves a drop in Median Rotation Error from 31.6° to 12.7°, Relative Accuracy@30° increases from 48.8% to 67.9% on MegaUnScene (Zhang et al., 27 Nov 2025).
| 3DFM Domain | Pretraining Corpus | Downstream Task | Main Metric, Result |
|---|---|---|---|
| CT-FM | 148k 3D CT scans | Segmentation | Dice 0.8981 ± .0022 |
| Triad | 131k 3D MRI volumes | Tumor Segmentation | +17.97% Dice (MWHS-Lung) |
| Polycrystal FM | 100k FCC microstructs | Stiffness R2 | 0.82–0.85 (pretrain) |
| D-DITR | 2D → 3D point clouds | 3D Segmentation | +8.1 mIoU (1% label) |
| VGGT-X | Internet images | Dense NVS | SSIM 0.7821 |
4. Evaluation Methodologies and Metrics
3DFM evaluations leverage both domain-standard and cross-domain metrics:
- Image/Volume Segmentation: Dice (), cross-entropy. Few-shot performance assessed via small-label regimes.
- Structure and Geometry: Intersection-over-Union (IoU), Chamfer Distance:
Point-to-mesh distances further detail surface alignment (Huang et al., 9 Nov 2025).
- Retrieval and Clustering: Content-based retrieval uses Precision@3, AP; representation clustering via Silhouette scores and t-SNE visualization (Pai et al., 15 Jan 2025).
- Pose/Depth/Map Estimation: Median Rotation Error (MRE), Relative Accuracy@30°, metric-scaled reconstruction errors (ACC, CMP) on curated Internet benchmarks (Zhang et al., 27 Nov 2025). Pose translation angular error (MTE) quantifies translation quality under large viewpoint changes.
- Mechanics Modeling: for stiffness regression, mean/max relative errors for stress-strain prediction in materials modeling (Wei et al., 7 Dec 2025).
5. Emergence, Generalization, and Adaptation Mechanisms
- Emergent Geometric Priors: 3DFMs trained on overlapping views learn global attention patterns that transfer to extreme, non-overlapping baseline conditions. This is evidenced by cross-view attention focusing on semantically aligned but spatially disjoint regions, yielding nontrivial pose reasoning “far beyond classical 2D correspondence” (Zhang et al., 27 Nov 2025).
- Transferability in Data-Scarce Regimes: Pretrained 3DFMs consistently outperform scratch and 2D-trained baselines in low-label evaluations—for example, +8.1 mIoU (1% label) in 3D segmentation few-shot settings (Zeid et al., 24 Mar 2025), or large boosts in physics-consistent property inference with thin or noisy labels (R2 <0.09 vs. >0.80) (Wei et al., 7 Dec 2025).
- Efficient Adaptation: Lightweight alignment adjusts only backbone bias terms (~80k parameters out of many millions) without touching decoder heads, significantly improving pose estimation under extreme viewpoints while preserving depth/point accuracy (Zhang et al., 27 Nov 2025). Chunked and mixed-precision engineering enables scaling to dense multi-image 3DFM regimes (VGGT-X) (Liu et al., 29 Sep 2025).
6. Datasets, Simulation Pipelines, and Standardization
- Large-Scale and Synthetic Data: BuildingWorld offers a 5 million–building LoD2 dataset with diverse real/simulated LiDAR, and Cyber City augments this with virtually infinite urban layouts and controllable sensor parameters, supporting robust pretraining, sim-to-real transfer, and curriculum learning for occlusion and density (Huang et al., 9 Nov 2025).
- Medical Imaging Repositories: CT-FM uses 148k CTs across 69 publicly available cohorts; Triad leverages 131k MRI volumes, all preprocessed for geometric consistency and metadata alignment (Pai et al., 15 Jan 2025, Wang et al., 19 Feb 2025).
- Novel 3D Vision Benchmarks: MegaUnScene comprises 476 Internet scenes, curated for unseen geometry and extreme-view evaluation. UnScenePairs/UnSceneRecon provide no-overlap pairwise and dense reconstruction splits benchmarked for generalization (Zhang et al., 27 Nov 2025).
7. Limitations and Open Problems
- Data Modalities and Domain Transfer: Generalization across 3D imaging modalities (CT→MRI→PET) remains limited; Triad and CT-FM both show strong within-modality transfer, but adaptation across modalities requires new hybrid strategies (Wang et al., 19 Feb 2025, Pai et al., 15 Jan 2025).
- Computational Scalability: Dense-view scenarios in NVS challenge transformer memory budgets (VGGT VRAM scales ); solutions include feature pruning, mixed-precision, and chunked attention, yet tight integration between 3DFM and volumetric fitting remains an open topic (Liu et al., 29 Sep 2025).
- Extreme Geometry and Robustness: Although bias-only alignment dramatically improves rotation estimation, translation accuracy under large camera baselines and metric scale recovery are still suboptimal (Zhang et al., 27 Nov 2025).
- Semantic/Structural Integration: Current models only partially unify language, image, and geometry; joint multimodal encoders and advanced self-supervised objectives (e.g., masked autoencoding, generative modeling at scale) are proposed as the next phase.
In summary, Three-Dimensional Foundation Models constitute a robust, scalable framework for cross-domain 3D reasoning, integrating advances in self-supervised learning, volumetric feature encoding, and geometric generalization. As benchmark datasets, cross-modal supervision, and efficient transformer architectures continue to mature, 3DFMs are increasingly poised to anchor 3D perception and modeling pipelines across sciences, engineering, and computer vision (Wei et al., 7 Dec 2025, Pai et al., 15 Jan 2025, Wang et al., 19 Feb 2025, Zeid et al., 24 Mar 2025, Zhang et al., 27 Nov 2025, Liu et al., 29 Sep 2025, Huang et al., 9 Nov 2025).