- The paper introduces vesselFM, a specialized model that integrates real and synthetic datasets for universal 3D blood vessel segmentation.
- It employs innovative data synthesis and flow matching techniques to overcome domain shifts, achieving superior Dice and clDice scores over baselines.
- The model’s zero-, one-, and few-shot capabilities make it a versatile tool for precise segmentation in both clinical and pre-clinical settings.
The paper "vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation" (2411.17386) addresses the critical challenge of segmenting 3D blood vessels in medical images across diverse modalities and anatomical regions. Existing supervised methods struggle with generalization due to significant domain shifts (imaging artifacts, vascular patterns, scale variations, SNR, background tissues) and require extensive, costly voxel-level annotations for each new dataset. While general-purpose foundation models have emerged, they typically fail to handle the unique complexities of vessel segmentation.
vesselFM is proposed as a foundation model specifically tailored for universal 3D blood vessel segmentation, designed to generalize effortlessly to unseen domains in a zero-shot, one-shot, or few-shot manner. The key to its generalization capability lies in its training data, which comes from three heterogeneous sources:
- Dreal: A large, curated dataset of real 3D vascular images with voxel-level annotations. Comprising 115,461 patches of size 1283 from 23 datasets covering various modalities (MRA, CTA, vEM, OCTA, CT, etc.), anatomical regions (brain, kidney, liver), and organisms (human, mouse, rat). This dataset aims to capture a broad range of real-world vascular patterns and domain variations. Pre-processing involves tiling, resampling, mask post-processing (smoothing, binarization), cropping, and intensity clipping to ensure quality and standardized input size.
- Ddrand: Synthetic data generated using an elaborate domain randomization scheme tailored for 3D blood vessels. This pipeline involves:
- Foreground generation: Starting with realistic vascular patches derived from corrosion casts, spatial transformations (cropping, flipping, rotation, dilation, zooming, elastic deformation, smoothing) are applied to create diverse synthetic masks (Msyn). Subsequently, artifact transformations (bias field, noise, smoothing, dropout, shift, hull) are applied to emulate real-world imaging artifacts.
- Background generation: Background images (B) are created with various geometries (spheres, polyhedrons, none) and textures modeled using versatile Perlin noise patterns, along with plain backgrounds.
- Fore- and background merging: Synthetic masks $T(\mathcal{M_\text{syn})$ are merged into the background images B using addition/subtraction or replacement, ensuring foreground intensities are distinct from the background. Finally, a wide range of intensity transformations (bias field, Gaussian noise, k-space spikes, contrast adjustment, Gaussian smoothing, Rician noise, Gibbs noise, sharpening, histogram transformation) are applied to further increase domain diversity.
This source is designed to comprehensively cover the general domain of 3D vascular images with semi-randomized styles. 500,000 image-mask pairs of size 1283 were generated for Ddrand.
- Dflow: Synthetic data sampled from a mask- and class-conditioned flow matching-based generative model (F). Flow matching is used as an alternative to diffusion models, trained to learn a time-dependent velocity field vθ(xt,m,c,t) that maps samples from a normal distribution (x0) to data distribution samples (x1) via an ODE:
dtdxt=vθ(xt,m,c,t).
The model is trained using the Conditional Flow Matching (CFM) objective, minimizing ∥vθ(xt,m,c,t)−ut(xt∣x1)∥2, where ut(xt∣x1)=(x1−xt)/(1−t) for the time-linear forward process xt=tx1+(1−t)x0. Mask conditioning is achieved by concatenating the mask channel-wise, and class conditioning by adding class embeddings to time embeddings. F is trained on data from Dreal and Ddrand but samples images conditioned exclusively on synthetic masks Msyn to avoid incorporating annotator biases from Dreal. Dflow effectively broadens the distributions present in Dreal in a data-driven manner. 10,000 image-mask pairs of size 1283 were sampled for Dflow.
VesselFM utilizes a UNet architecture [isensee2021nnu] for the segmentation model. It is trained on the combined dataset (Dreal, Ddrand, Dflow) with weights roughly proportional to their sizes (70% Ddrand, 20% Dreal, 10% Dflow). Training uses a combination of Dice and cross-entropy losses, linear warm-up, and cosine annealing. Fine-tuning for one- and few-shot tasks is performed with lighter data augmentation.
The model was evaluated on four unseen datasets: SMILE-UHURA (human brain MRA), BvEM (mouse brain vEM), OCTA (mouse brain OCTA), and MSD8 (human liver CT), covering clinical and pre-clinical relevance. Evaluation metrics included Dice and clDice. VesselFM was compared against state-of-the-art medical segmentation foundation models: tUbeNet [holroyd2023tube], VISTA3D [he2024vista3d], SAM-Med3D [wang2024sammed3d], and MedSAM-2 [zhu2024medical].
Quantitative Results: VesselFM consistently and significantly outperforms all baseline models across zero-, one-, and few-shot settings on all evaluation datasets. In the zero-shot task, vesselFM achieves considerably higher Dice and clDice scores than baselines, demonstrating strong generalization even to challenging modalities like BvEM and OCTA, and outperforming models like VISTA3D trained on similar data (MSD8). General-purpose SAM-like models failed in the zero-shot setting for vessel segmentation. Fine-tuning vesselFM in one- or few-shot scenarios further improves performance, and ablations show that pre-training on the three data sources is crucial compared to training from scratch.
Qualitative Results: Visual results confirm the quantitative findings, showing that vesselFM segments blood vessels accurately with high fidelity, free of common artifacts, and preserving tubular structure. It also demonstrates the ability to segment other tubular structures (axons, colon) in some modalities, highlighting a strong inductive bias towards tubular shapes.
Ablation Studies:
- Data Sources: Ablating the training data sources confirmed the importance of combining all three. Augmenting Dreal with Ddrand and Dflow resulted in a significant performance increase (e.g., +9.21 Dice on SMILE-UHURA zero-shot).
- Flow Matching: Ablations on the flow matching model F showed that training F on Ddrand is beneficial, conditioning on synthetic masks Msyn is better than real masks Mreal (due to diversity and lack of annotator bias), and flow matching outperforms the diffusion-based Med-DDPM both quantitatively and qualitatively in generating realistic vessel images.
- Architecture: Ablating the segmentation network architecture showed that the chosen UNet variant [isensee2021nnu] performed best compared to other UNet-based and Transformer-based models like SwinUNETR, UNETR, 3D UX-Net, and MedNeXt.
Implementation Considerations:
- Data Requirements: Training requires a large, diverse dataset like Dreal, synthetic data generation pipelines (Ddrand), and a generative model like flow matching (Dflow). Curating and generating these datasets are substantial undertakings.
- Computational Resources: Training vesselFM and the generative model F is computationally intensive. F generation alone took three days on a single RTX A6000 GPU for 10,000 samples. Training vesselFM used a single V100 GPU. Scaling to larger datasets or models would require more resources.
- Architecture: The UNet architecture from nnU-Net is used, known for its effectiveness in medical image segmentation but potentially less parameter-efficient than Transformers for very large models.
- Deployment: The model is designed for zero-, one-, or few-shot applications, making it practical for scenarios with limited annotated data in new domains. Inference on 1283 patches would be standard, requiring strategies for processing larger volumes (e.g., sliding window inference).
- Limitations: Currently limited to binary blood vessel segmentation. The model also shows a tendency to segment other tubular structures, which might be considered a limitation depending on the specific application, although it also suggests a strong tubular-shape prior.
The authors have open-sourced the checkpoints and code, providing a practical, out-of-the-box tool for researchers and clinicians. Future work aims to extend vesselFM to segment other tubular structures, improve connectivity post-processing (potentially using graph-based methods), and handle multi-class segmentation tasks. The research significantly pushes the state-of-the-art in 3D blood vessel segmentation, potentially enabling more precise analysis, diagnosis, and treatment of vascular disorders.