3D Foundation Models (FMs)

Updated 11 December 2025

3D Foundation Models are large-scale pre-trained neural networks that extract generalizable features from volumetric and sparse data, enabling efficient adaptation across various disciplines.
They utilize self-supervised objectives such as masked region modeling, geometry-aware prediction, and contrastive learning to capture complex spatial structures.
Scalable architectures and adapter-based fine-tuning enhance their performance in applications ranging from particle physics and medical imaging to materials science and seismic analysis.

Three-dimensional (3D) Foundation Models (FMs) are large-scale, pre-trained neural networks designed to extract and transfer general representations from volumetric or spatially sparse 3D data. They extend the successes of foundation models in language and vision to a diverse constellation of domains, including high-energy physics, medical imaging, seismic interpretation, motion analysis, materials informatics, and 3D point clouds. By leveraging self-supervised objectives, scalable architectures, and voluminous unlabeled datasets, 3D FMs provide a platform for efficient adaptation to a wide range of downstream scientific and engineering tasks with minimal labeled data requirements.

1. Core Architectures and Input Representations

3D FMs operate on a variety of input formats reflecting the diversity of 3D data across disciplines. The prevalent architectural backbones can be broadly categorized as follows:

Volumetric (Voxelized) Models: Models such as 3D Vision Transformers (ViT-3D), 3D convolutional neural networks, and U-Net/ResNet derivatives operate on regularly gridded volumetric tensors, e.g., medical CT/MRI volumes, seismic cubes, and microstructures (Veenboer et al., 30 Nov 2025, Ghamizi et al., 16 Jun 2025, Pai et al., 15 Jan 2025, Wei et al., 7 Dec 2025, Archibong et al., 26 May 2025).
Irregular/Sparse 3D Data Models: Structured State Space Models (SSMs; e.g., Mamba2) and point-based transformers or sequence models address the problem of event-based detector data (high-energy physics), where the underlying data distribution is highly non-uniform and sparse in space (Park et al., 13 Aug 2025, Thengane et al., 30 Jan 2025).
Spatio-temporal Representations: For time-dependent 3D signals, such as motion capture data, models construct spatio-temporal heatmaps or "Thermal Cubes," which are then tokenized for transformer-based architectures (Baharani et al., 8 Feb 2025).

Input serialization is frequently necessary, especially for sparse data. FM4NPP, for example, introduces Hierarchical Raster Scan (HRS) to serialize 3D detector hits, preserving both local trajectory continuity and global event structure (Park et al., 13 Aug 2025); volumetric models typically partition the grid into 3D patches.

2. Pretraining Objectives and Self-Supervision

3D FMs rely on self-supervised pretraining, employing multiple strategies tailored to the structural and physical properties of the data:

Masked Region Modeling: Random 3D patches or points are hidden and models are trained to reconstruct the obscured content from visible inputs, as in 3D masked autoencoders for materials microstructures and volumetric ViTs for brain imaging or medical CT (Wei et al., 7 Dec 2025, Ghamizi et al., 16 Jun 2025, Veenboer et al., 30 Nov 2025).
Geometry-aware Prediction: Physics-informed objectives, such as next-nearest neighbor prediction within detector data (FM4NPP), leverage domain constraints by having the model infer geometric context for each constituent element (Park et al., 13 Aug 2025).
Contrastive Learning: Patch-wise or volume-wise contrastive objectives differentiate between augmented views or crops of the same object/region (positives) and different objects or regions (negatives). SimCLR-style contrastive learning and DINO/iBOT-style self-distillation have been adapted to the 3D domain (Archibong et al., 26 May 2025, Pai et al., 15 Jan 2025, Veenboer et al., 30 Nov 2025).
Masked Token Prediction for Discrete Representations: For motion data, discrete variational encoding followed by masked token modeling enables effective semantic transfer in token space (Baharani et al., 8 Feb 2025).
Multimodal and Cross-modal Pretraining: In point cloud FMs, cross-modal contrastive, distillation, or triplet-alignment losses integrate 2D vision, language, and 3D features (Thengane et al., 30 Jan 2025).

Pretraining datasets are typically in the 10⁵–10⁶ sample range, capitalizing on rich unlabeled or synthetically generated data (e.g., 11M particle events, 100,000 RVEs, 148,000 CT scans, etc.).

3. Scalability, Architectural Choices, and Neural Scaling

The efficacy of 3D FMs depends on scalable architectures, efficient serialization, and parameterizations optimized for large models:

Efficient Sequence Models: Linear-time SSMs such as Mamba2 allow efficient processing of long event sequences (up to 188M parameters) in physics FMs, enabling direct scaling to multi-million token datasets (Park et al., 13 Aug 2025).
ViT and Transformer Backbones: Adaptations of ViT to 3D inputs via volumetric patch embedding, trilinear-interpolated positional encodings, and depth-aware augmentation/pooling enable generalization across volume resolutions and domains (Veenboer et al., 30 Nov 2025, Ghamizi et al., 16 Jun 2025, Archibong et al., 26 May 2025, Baharani et al., 8 Feb 2025).
Mask Ratio Optimization: In masked modeling, the transferability peaks at intermediate masking ratios (e.g., 40% for materials FMs); reconstruction error increases monotonically with mask ratio, but downstream generalization does not (Wei et al., 7 Dec 2025).

Neural scaling properties are documented, with loss decreasing as a power law with respect to model size, data size, and compute, and marginal plateauing at extreme scales (Park et al., 13 Aug 2025).

4. Adaptation and Fine-Tuning: Adapter Paradigms and Task Interfaces

Fine-tuning strategies are structured to extract maximal task-specific utility from the general representations of 3D FMs:

Adapter-Based Fine-Tuning: Adapter modules, often lightweight projection heads or transformer decoders, are attached to the frozen FM backbone. They are optimized for individual downstream tasks (e.g., track instance segmentation, part segmentation, action recognition) without altering foundational parameters (Park et al., 13 Aug 2025, Baharani et al., 8 Feb 2025).
Plug-and-play Decoders: For segmentation and detection, plug-in decoders such as progressive upsampling stacks or SegResNet mirrors decode the feature maps from 3D FMs (Archibong et al., 26 May 2025, Pai et al., 15 Jan 2025, Veenboer et al., 30 Nov 2025).
Minimal Labeled Data Regimes: Data-efficient adaptation is demonstrated, with larger FMs achieving strong generalization (e.g., ARI ≈ 0.9 in FM4NPP) with as little as 1/30th of the labeled data required for non-pretrained baselines (Park et al., 13 Aug 2025).

The adapters often involve a single linear mapping (e.g., for specialization in classification or instance assignment) and exhibit strong monotonic scaling in downstream performance with FM size.

5. Generalization, Evaluation, and Benchmarks

Evaluation protocols for 3D FMs encompass diverse scientific tasks:

Particle Physics: ARI for track finding, tracking efficiency, and purity; larger FM4NPP models significantly outperform baselines such as Exa.TrkX (ARI 0.9448 vs. 0.8765) (Park et al., 13 Aug 2025).
Medical Imaging: Dice coefficient is the most common for segmentation (e.g., CT-FM achieving Dice 0.8981 on the TotalSegmentator v2 dataset), AUC/AP for classification/retrieval, test–retest robustness, and occlusion-based feature deviation mapping for interpretability (Pai et al., 15 Jan 2025, Veenboer et al., 30 Nov 2025, Ghamizi et al., 16 Jun 2025).
Seismic Interpretation: Mean IoU and pixel accuracy for segmentation tasks, surpassing conventional supervised U-Net models by 5–10 pp (Archibong et al., 26 May 2025).
Materials Science: R² for stiffness prediction (>0.8 for pre-trained models vs. 0.08 from scratch), mean/max relative errors for stress–strain prediction (<9%) (Wei et al., 7 Dec 2025).
Point Clouds and Open-World 3D: mIoU, mAP25, zero- and few-shot classification accuracy on ModelNet40, ScanObjectNN, and PartNet; triplet-alignment and part segmentation benchmarks (Thengane et al., 30 Jan 2025).

Zero-shot adaptation, task-agnostic frozen features, and scalable benchmarks are emphasized, including public releases of model weights, configurations, and evaluation protocols (Veenboer et al., 30 Nov 2025, Pai et al., 15 Jan 2025).

6. Representation Analysis and Transfer Properties

3D FM embeddings generally exhibit the following:

Task-agnostic, Generalizable Features: Embedding clusters do not align a priori with specific target classes or phenomena; specialization occurs via simple linear projections or adapters (Park et al., 13 Aug 2025).
Linear Specialization: Single linear heads can produce well-separated clusters for classes or instances (as verified through PCA/t-SNE/UMAP) (Park et al., 13 Aug 2025).
Strong Transfer Across Tasks: Encoders pre-trained with self-supervision transfer robustly to both physics (e.g., predicting nonlinear mechanical response) and perceptual tasks (e.g., anatomical clustering in medical imaging) (Wei et al., 7 Dec 2025, Pai et al., 15 Jan 2025).

Scaling the model improves the quality and universality of representations, as measured by downstream metrics and data efficiency.

7. Future Directions, Limitations, and Cross-Domain Applicability

Despite remarkable progress, the field of 3D FMs faces critical challenges and open research areas:

Scaling and Compute Efficiency: Efficient SSMs and sparse 3D attention architectures remain key to scaling models to larger data and compute budgets, especially for point clouds and irregular domains (Park et al., 13 Aug 2025, Thengane et al., 30 Jan 2025, Ghamizi et al., 16 Jun 2025).
Multimodal Integration: Combining language, vision, physics, and other sensory modalities in 3D representation learning is a central challenge, with initiatives in cross-modal alignment via triplet or dual-encoder paradigms (Thengane et al., 30 Jan 2025, Ghamizi et al., 16 Jun 2025).
Robustness and Fairness: Handling heterogeneous acquisition protocols, missing modalities, data privacy (federated learning), and bias mitigation are identified as necessary for clinical and industrial deployment (Ghamizi et al., 16 Jun 2025, Veenboer et al., 30 Nov 2025).
Generalization Beyond Synthetic Data: Extending pretraining and adaptation protocols to experimentally acquired or real-world 3D data (e.g., X-ray tomography, EBSD, field seismic surveys) remains nontrivial (Wei et al., 7 Dec 2025, Archibong et al., 26 May 2025).
Open-source Benchmarks and Model Release: Establishing unified 3D benchmarks and codebases catalyzes reproducible research and facilitates rapid progress by the community (Veenboer et al., 30 Nov 2025, Pai et al., 15 Jan 2025).
Applications Expansion: 3D FMs are being rapidly adopted in high-energy physics, geosciences, biomedicine, materials design, robotics, and perception, with extensions to structure-based drug discovery, climate science, and autonomous systems plausible.

Trade-offs between task-specific supervision and the breadth of pretraining, memory footprint (especially for large token vocabularies), and the need for improved masking/augmentation strategies are highlighted as ongoing bottlenecks (Baharani et al., 8 Feb 2025, Wei et al., 7 Dec 2025).

References: