3D Few-Shot Segmentation Overview

Updated 27 March 2026

3D few-shot segmentation is a paradigm enabling the segmentation of novel classes in volumetric and point cloud data with extremely limited annotations using meta-learning and prototype-driven strategies.
It employs diverse techniques, including prototypical networks, bidirectional RNNs, and graph-based methods to improve metrics like mIoU and Dice scores across medical and 3D shape benchmarks.
Recent advances incorporate training-free approaches and foundation model adaptation to enhance robustness and efficiency under severe annotation scarcity.

3D few-shot segmentation addresses the problem of segmenting novel anatomical structures or semantic entities in volumetric or point cloud data, given extremely limited annotated examples. This paradigm is motivated by the high cost and limited availability of annotated 3D data, especially in medical imaging and autonomous systems, and seeks to maximize generalizability under severe annotation scarcity using meta-learning, prototype-driven or correlation-based techniques. Unlike conventional supervised 3D segmentation, which assumes abundant labeled data and static class taxonomies, few-shot segmentation dynamically adapts to previously unseen classes, organs, or parts with only a handful of labeled references. This article surveys the core methodologies, architectural innovations, and empirical performance characteristics of contemporary 3D few-shot segmentation, spanning both volumetric (voxel/slice-based) and point cloud representations.

1. Problem Formulation and Benchmarks

Under the N-way K-shot episodic paradigm, one is given a support set $\mathcal{S} = \{(X_s^{n,k}, Y_s^{n,k})\}$ consisting of $K$ labeled examples per $N$ target class, alongside query data $\mathcal{Q} = \{X_q\}$ for segmentation. The goal is to produce accurate per-voxel or per-point semantic masks for the classes present in the support set, with performance measured by class mean intersection-over-union (mIoU), Dice coefficient, or related metrics (Mozafari et al., 2024, Zhang et al., 2023, Zheng et al., 2024).

Table: Common Protocols and Datasets

Domain	Task Structure	Example Datasets
Medical imaging	Organ or lesion seg.	MICCAI 2015 Abdomen, CHAOS 2019, BTCV, ACDC, BraTS
3D shapes	Part segmentation	ShapeNet Part, PartNet-E, CAM/CAD custom datasets
Scenes (point)	Semantic segmentation	S3DIS, ScanNet, ScanNet200, SemanticKITTI (LiDAR)

Protocols extend to generalized and incremental settings, requiring models to retain base-class knowledge while learning new classes ("generalized few-shot segmentation") (Xu et al., 2023, An et al., 20 Mar 2025, Thengane et al., 6 Mar 2026).

2. Meta-Learning and Prototype-Based Approaches in Volumetric Data

Prototypical approaches dominate few-shot segmentation for 3D volumes. The standard pipeline constructs class prototypes from support slices using masked feature averaging. For a class $c$ , the prototype is:

$p_c = \frac{1}{|S_c|} \sum_{(x^{(s)}, y^{(s)}) \in S_c} \mathrm{MaskedAvgPool}\left( \phi(x^{(s)}), y^{(s)} = c \right)$

Each query voxel or pixel is assigned a softmax probability over classes based on negative cosine distance to prototypes (Mozafari et al., 2024).

Recent work exploits the unlabeled query volume for semi-supervised inference: high-confidence pseudo-labels from the query are mined and used to refine prototypes at inference time, yielding augmented prototypes:

$p_c' = \frac{|S_c| \cdot p_c + |Q^{(\text{pseudo})}_c| \cdot p^Q_c}{|S_c| + |Q^{(\text{pseudo})}_c|}$

This inference-time pseudo-labeling achieves consistent gains (+1.6 to +4.6 Dice points) across abdominal CT and MRI datasets, with optimal results obtained when aggregating prototypes over a moderate window of neighboring slices (e.g., $\pm 7$ slices) (Mozafari et al., 2024).

Bidirectional RNNs, particularly convolutional GRUs, are integrated to enforce inter-slice consistency. Such architectures encode support and query slices, then propagate context bi-directionally along the axial dimension, leading to improved organ segmentation compared to standard prototypical or U-Net baselines (Kim et al., 2020). Adaptation via brief fine-tuning on the support set further enhances flexibility across domains and modalities.

3. Point Cloud and Part Segmentation: Meta-Learning, Correlation, and Graph Models

Few-shot 3D point cloud segmentation models can be categorized as follows:

a) Multi-Prototype and Graph-Based Methods:

Class distributions in 3D are often multi-modal. Techniques such as multi-prototype representation and transductive label propagation (on prototype–query affinity graphs) robustly capture intra-class variability (Zhao et al., 2020, Wang et al., 2022). Segment graphs constructed from 2D foundation model segmentations (e.g., SAM segments) and their geometric/topological relations further boost part consistency and fine structure delineation (Hu et al., 18 Dec 2025).

b) Transformer and Correlation-Based Models:

Class-specific transformers and stratified attention mechanisms operate at multiple spatial scales (fine to coarse sub-volumes), preserving fine-grained support–query interactions. Stratified class-specific attention avoids early pooling, retaining all pixel-level or point-level relationships (Zhang et al., 2023). Correlation-based optimization, such as COSeg, directly enriches class–point and class–class interactions via multi-prototypical and hyper correlation augmentation, outperforming feature-only meta-learning (An et al., 2024).

c) Meta-Learning in Function Space:

Meta-learners train on distributions of part segmentation tasks, learning initializations or priors allowing rapid adaptation to new segmentation functions. In Meta-3DSeg, a meta-level variational module predicts optimal parameter shifts for a part segmentation backbone, using per-task statistics aggregated across multiple support-query episodes (Hao et al., 2021).

4. Training-Free, Non-Parametric, and Foundation Model Approaches

Recent work demonstrates that training-free and non-parametric models can deliver competitive few-shot segmentation accuracy, reducing domain gaps and computational burden:

Training-Free Networks:

Frameworks such as TFS3D and Seg-NN extract dense representations via trigonometric positional encodings and fixed, hand-crafted filters, forgoing any learned weights. Segmentation is performed by cosine-similarity to support-derived prototypes. The introduction of lightweight trainable attention adapters (QUEST) as in TFS3D-T or Seg-PN further closes the gap to parametric models—improving state-of-the-art mIoU by 6–18 points while reducing training time by 90% (Zhu et al., 2023, Zhu et al., 2024).

Foundation Model Adaptation:

Foundation models trained on large-scale 2D data (e.g., SAM2) are adapted for few-shot 3D medical segmentation. FATE-SAM reuses a frozen encoder, memory-attention module, and mask decoder, assembling query-specific segmentation masks by retrieving a handful of support slices as memory. Volumetric consistency is maintained by propagating masked embeddings, without any re-training or prompts, achieving substantial improvements over both fine-tuned and zero-shot counterparts across diverse medical datasets (He et al., 15 Jan 2025).

5. Multi-Surrogate Fusion, Multimodal, and Cross-Domain Innovations

To address morphological variability and sparse annotations, new fusion modules exploit both local and global correlations:

Multi-Surrogate Fusion:

MSFSeg synthesizes multiple surrogates (coherence, diversity, channel-attention, stabilization) over multi-scale query-support correlations. Each surrogate captures complementary structural information, and their fusion via 3D convolutions yields robust scene-wise generalization—demonstrated to outperform cost-aggregation transformers and multitask few-shot methods by up to 5.45 Dice points on conventional and cross-volume 3D benchmarks (Zheng et al., 2024).

Multimodal Cross-correlation:

MultiModal-FSS (MM-FSS) integrates 3D, 2D, and natural language features. During training, point cloud representations are aligned to a vision-language space using 2D images and text encodings. At test-time, multimodal correlation fusion and adaptive cross-modal calibration attenuate base-class bias and encourage semantic consistency, yielding significant mIoU gains (10–15 points) on S3DIS and ScanNet (An et al., 2024).

6. Generalized, Incremental, and Forgetting-Free Segmentation

Practical deployments mandate retention of both base and novel class performance in continuously evolving environments.

Generalized Few-Shot Segmentation (GFS-3DSeg):

Models predict both base and novel classes, employing geometric words to encode recurring shape primitives and geometric prototypes for class disambiguation. Prototype-guided pseudo-labeling and adaptive infilling with vision-LLMs are used to synthesize dense supervision from few-shot support, closing the mIoU harmonic mean gap between base and novel classes (Xu et al., 2023, An et al., 20 Mar 2025).

Incremental Few-Shot Segmentation (IFS-PCS):

SCOPE leverages scene context by mining pseudo-instances from background regions (via class-agnostic segmenters) and fusing these with few-shot prototypes during novel class registration. Contextual prototype enrichment is parameter-free and maintains stability-plasticity equilibrium (minimal base-class forgetting and substantial novel-class IoU gains) (Thengane et al., 6 Mar 2026).

Forgetting-Free Methods and LiDAR:

In driving scenarios with sparse supervision, data augmentation through tracking (TeFF) expands annotated frames using video object segmentation, while LoRA-based adaptation restricts forgetting by constraining trainable parameters. These strategies yield the highest novel-class mIoU under few-shot LiDAR segmentation and low computational load (Zhou et al., 2024, Mei et al., 2023).

7. Practical Considerations, Limitations, and Future Directions

Key operational insights include:

Effective use of volumetric context (e.g., ±7 slices for organ-focused 3D segmentation) enhances prototype reliability (Mozafari et al., 2024).
Single-step pseudo-labeling outperforms iterative self-training, which accumulates errors.
Combining, rather than replacing, support and confident query prototypes is optimal.
Uniform, dense point sampling is crucial to avoid foreground leakage artifacts in point cloud segmentation (An et al., 2024).
Training-free and non-parametric methods are highly competitive for deployment when annotation or compute constraints demand efficiency.

Current limitations include sensitivity to annotation noise, domain gap in unseen modalities, and lack of shape priors for thin or ambiguous structures (He et al., 15 Jan 2025, Zheng et al., 2024). Proposed extensions span end-to-end 3D backbones, adaptive prototype banks, cross-modal foundation model distillation, and open-set label discovery in dynamic or multimodal environments (Thengane et al., 6 Mar 2026, An et al., 20 Mar 2025).

In summary, 3D few-shot segmentation is advancing rapidly through synergistic use of meta-learning, prototype expansion, attention-based fusion, and foundation model adaptation. The field is evolving towards robust, scalable, and annotation-efficient models that uphold segmentation accuracy under the dual constraints of extreme data scarcity and continual class evolution.