Papers
Topics
Authors
Recent
2000 character limit reached

3D Foundation Models (FMs)

Updated 11 December 2025
  • 3D Foundation Models are large-scale pre-trained neural networks that extract generalizable features from volumetric and sparse data, enabling efficient adaptation across various disciplines.
  • They utilize self-supervised objectives such as masked region modeling, geometry-aware prediction, and contrastive learning to capture complex spatial structures.
  • Scalable architectures and adapter-based fine-tuning enhance their performance in applications ranging from particle physics and medical imaging to materials science and seismic analysis.

Three-dimensional (3D) Foundation Models (FMs) are large-scale, pre-trained neural networks designed to extract and transfer general representations from volumetric or spatially sparse 3D data. They extend the successes of foundation models in language and vision to a diverse constellation of domains, including high-energy physics, medical imaging, seismic interpretation, motion analysis, materials informatics, and 3D point clouds. By leveraging self-supervised objectives, scalable architectures, and voluminous unlabeled datasets, 3D FMs provide a platform for efficient adaptation to a wide range of downstream scientific and engineering tasks with minimal labeled data requirements.

1. Core Architectures and Input Representations

3D FMs operate on a variety of input formats reflecting the diversity of 3D data across disciplines. The prevalent architectural backbones can be broadly categorized as follows:

Input serialization is frequently necessary, especially for sparse data. FM4NPP, for example, introduces Hierarchical Raster Scan (HRS) to serialize 3D detector hits, preserving both local trajectory continuity and global event structure (Park et al., 13 Aug 2025); volumetric models typically partition the grid into 3D patches.

2. Pretraining Objectives and Self-Supervision

3D FMs rely on self-supervised pretraining, employing multiple strategies tailored to the structural and physical properties of the data:

Pretraining datasets are typically in the 10⁵–10⁶ sample range, capitalizing on rich unlabeled or synthetically generated data (e.g., 11M particle events, 100,000 RVEs, 148,000 CT scans, etc.).

3. Scalability, Architectural Choices, and Neural Scaling

The efficacy of 3D FMs depends on scalable architectures, efficient serialization, and parameterizations optimized for large models:

  • Efficient Sequence Models: Linear-time SSMs such as Mamba2 allow efficient processing of long event sequences (up to 188M parameters) in physics FMs, enabling direct scaling to multi-million token datasets (Park et al., 13 Aug 2025).
  • ViT and Transformer Backbones: Adaptations of ViT to 3D inputs via volumetric patch embedding, trilinear-interpolated positional encodings, and depth-aware augmentation/pooling enable generalization across volume resolutions and domains (Veenboer et al., 30 Nov 2025, Ghamizi et al., 16 Jun 2025, Archibong et al., 26 May 2025, Baharani et al., 8 Feb 2025).
  • Mask Ratio Optimization: In masked modeling, the transferability peaks at intermediate masking ratios (e.g., 40% for materials FMs); reconstruction error increases monotonically with mask ratio, but downstream generalization does not (Wei et al., 7 Dec 2025).

Neural scaling properties are documented, with loss decreasing as a power law with respect to model size, data size, and compute, and marginal plateauing at extreme scales (Park et al., 13 Aug 2025).

4. Adaptation and Fine-Tuning: Adapter Paradigms and Task Interfaces

Fine-tuning strategies are structured to extract maximal task-specific utility from the general representations of 3D FMs:

The adapters often involve a single linear mapping (e.g., for specialization in classification or instance assignment) and exhibit strong monotonic scaling in downstream performance with FM size.

5. Generalization, Evaluation, and Benchmarks

Evaluation protocols for 3D FMs encompass diverse scientific tasks:

  • Particle Physics: ARI for track finding, tracking efficiency, and purity; larger FM4NPP models significantly outperform baselines such as Exa.TrkX (ARI 0.9448 vs. 0.8765) (Park et al., 13 Aug 2025).
  • Medical Imaging: Dice coefficient is the most common for segmentation (e.g., CT-FM achieving Dice 0.8981 on the TotalSegmentator v2 dataset), AUC/AP for classification/retrieval, test–retest robustness, and occlusion-based feature deviation mapping for interpretability (Pai et al., 15 Jan 2025, Veenboer et al., 30 Nov 2025, Ghamizi et al., 16 Jun 2025).
  • Seismic Interpretation: Mean IoU and pixel accuracy for segmentation tasks, surpassing conventional supervised U-Net models by 5–10 pp (Archibong et al., 26 May 2025).
  • Materials Science: R² for stiffness prediction (>0.8 for pre-trained models vs. 0.08 from scratch), mean/max relative errors for stress–strain prediction (<9%) (Wei et al., 7 Dec 2025).
  • Point Clouds and Open-World 3D: mIoU, mAP25, zero- and few-shot classification accuracy on ModelNet40, ScanObjectNN, and PartNet; triplet-alignment and part segmentation benchmarks (Thengane et al., 30 Jan 2025).

Zero-shot adaptation, task-agnostic frozen features, and scalable benchmarks are emphasized, including public releases of model weights, configurations, and evaluation protocols (Veenboer et al., 30 Nov 2025, Pai et al., 15 Jan 2025).

6. Representation Analysis and Transfer Properties

3D FM embeddings generally exhibit the following:

  • Task-agnostic, Generalizable Features: Embedding clusters do not align a priori with specific target classes or phenomena; specialization occurs via simple linear projections or adapters (Park et al., 13 Aug 2025).
  • Linear Specialization: Single linear heads can produce well-separated clusters for classes or instances (as verified through PCA/t-SNE/UMAP) (Park et al., 13 Aug 2025).
  • Strong Transfer Across Tasks: Encoders pre-trained with self-supervision transfer robustly to both physics (e.g., predicting nonlinear mechanical response) and perceptual tasks (e.g., anatomical clustering in medical imaging) (Wei et al., 7 Dec 2025, Pai et al., 15 Jan 2025).

Scaling the model improves the quality and universality of representations, as measured by downstream metrics and data efficiency.

7. Future Directions, Limitations, and Cross-Domain Applicability

Despite remarkable progress, the field of 3D FMs faces critical challenges and open research areas:

  • Scaling and Compute Efficiency: Efficient SSMs and sparse 3D attention architectures remain key to scaling models to larger data and compute budgets, especially for point clouds and irregular domains (Park et al., 13 Aug 2025, Thengane et al., 30 Jan 2025, Ghamizi et al., 16 Jun 2025).
  • Multimodal Integration: Combining language, vision, physics, and other sensory modalities in 3D representation learning is a central challenge, with initiatives in cross-modal alignment via triplet or dual-encoder paradigms (Thengane et al., 30 Jan 2025, Ghamizi et al., 16 Jun 2025).
  • Robustness and Fairness: Handling heterogeneous acquisition protocols, missing modalities, data privacy (federated learning), and bias mitigation are identified as necessary for clinical and industrial deployment (Ghamizi et al., 16 Jun 2025, Veenboer et al., 30 Nov 2025).
  • Generalization Beyond Synthetic Data: Extending pretraining and adaptation protocols to experimentally acquired or real-world 3D data (e.g., X-ray tomography, EBSD, field seismic surveys) remains nontrivial (Wei et al., 7 Dec 2025, Archibong et al., 26 May 2025).
  • Open-source Benchmarks and Model Release: Establishing unified 3D benchmarks and codebases catalyzes reproducible research and facilitates rapid progress by the community (Veenboer et al., 30 Nov 2025, Pai et al., 15 Jan 2025).
  • Applications Expansion: 3D FMs are being rapidly adopted in high-energy physics, geosciences, biomedicine, materials design, robotics, and perception, with extensions to structure-based drug discovery, climate science, and autonomous systems plausible.

Trade-offs between task-specific supervision and the breadth of pretraining, memory footprint (especially for large token vocabularies), and the need for improved masking/augmentation strategies are highlighted as ongoing bottlenecks (Baharani et al., 8 Feb 2025, Wei et al., 7 Dec 2025).


References:

Whiteboard

Follow Topic

Get notified by email when new papers are published related to 3D Foundation Models (FMs).