vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation (2411.17386v2)

Published 26 Nov 2024 in eess.IV and cs.CV

Abstract: Segmenting 3D blood vessels is a critical yet challenging task in medical image analysis. This is due to significant imaging modality-specific variations in artifacts, vascular patterns and scales, signal-to-noise ratios, and background tissues. These variations, along with domain gaps arising from varying imaging protocols, limit the generalization of existing supervised learning-based methods, requiring tedious voxel-level annotations for each dataset separately. While foundation models promise to alleviate this limitation, they typically fail to generalize to the task of blood vessel segmentation, posing a unique, complex problem. In this work, we present vesselFM, a foundation model designed specifically for the broad task of 3D blood vessel segmentation. Unlike previous models, vesselFM can effortlessly generalize to unseen domains. To achieve zero-shot generalization, we train vesselFM on three heterogeneous data sources: a large, curated annotated dataset, data generated by a domain randomization scheme, and data sampled from a flow matching-based generative model. Extensive evaluations show that vesselFM outperforms state-of-the-art medical image segmentation foundation models across four (pre-)clinically relevant imaging modalities in zero-, one-, and few-shot scenarios, therefore providing a universal solution for 3D blood vessel segmentation.

Summary

The paper introduces vesselFM, a specialized model that integrates real and synthetic datasets for universal 3D blood vessel segmentation.
It employs innovative data synthesis and flow matching techniques to overcome domain shifts, achieving superior Dice and clDice scores over baselines.
The model’s zero-, one-, and few-shot capabilities make it a versatile tool for precise segmentation in both clinical and pre-clinical settings.

The paper "vesselFM: A Foundation Model for Universal 3D Blood Vessel Segmentation" (2411.17386) addresses the critical challenge of segmenting 3D blood vessels in medical images across diverse modalities and anatomical regions. Existing supervised methods struggle with generalization due to significant domain shifts (imaging artifacts, vascular patterns, scale variations, SNR, background tissues) and require extensive, costly voxel-level annotations for each new dataset. While general-purpose foundation models have emerged, they typically fail to handle the unique complexities of vessel segmentation.

vesselFM is proposed as a foundation model specifically tailored for universal 3D blood vessel segmentation, designed to generalize effortlessly to unseen domains in a zero-shot, one-shot, or few-shot manner. The key to its generalization capability lies in its training data, which comes from three heterogeneous sources:

$\mathcal{D}_\text{real}$ : A large, curated dataset of real 3D vascular images with voxel-level annotations. Comprising 115,461 patches of size $128^3$ from 23 datasets covering various modalities (MRA, CTA, vEM, OCTA, CT, etc.), anatomical regions (brain, kidney, liver), and organisms (human, mouse, rat). This dataset aims to capture a broad range of real-world vascular patterns and domain variations. Pre-processing involves tiling, resampling, mask post-processing (smoothing, binarization), cropping, and intensity clipping to ensure quality and standardized input size.
$\mathcal{D}_\text{drand}$ : Synthetic data generated using an elaborate domain randomization scheme tailored for 3D blood vessels. This pipeline involves:
- Foreground generation: Starting with realistic vascular patches derived from corrosion casts, spatial transformations (cropping, flipping, rotation, dilation, zooming, elastic deformation, smoothing) are applied to create diverse synthetic masks ( $\mathcal{M}_\text{syn}$ ). Subsequently, artifact transformations (bias field, noise, smoothing, dropout, shift, hull) are applied to emulate real-world imaging artifacts.
- Background generation: Background images ( $\mathcal{B}$ ) are created with various geometries (spheres, polyhedrons, none) and textures modeled using versatile Perlin noise patterns, along with plain backgrounds.
- Fore- and background merging: Synthetic masks $T(\mathcal{M_\text{syn})$ are merged into the background images $\mathcal{B}$ using addition/subtraction or replacement, ensuring foreground intensities are distinct from the background. Finally, a wide range of intensity transformations (bias field, Gaussian noise, k-space spikes, contrast adjustment, Gaussian smoothing, Rician noise, Gibbs noise, sharpening, histogram transformation) are applied to further increase domain diversity. This source is designed to comprehensively cover the general domain of 3D vascular images with semi-randomized styles. 500,000 image-mask pairs of size $128^3$ were generated for $\mathcal{D}_\text{drand}$ .
$\mathcal{D}_\text{flow}$ : Synthetic data sampled from a mask- and class-conditioned flow matching-based generative model ( $\mathcal{F}$ ). Flow matching is used as an alternative to diffusion models, trained to learn a time-dependent velocity field $v_\theta(x_t, m, c, t)$ that maps samples from a normal distribution ( $x_0$ ) to data distribution samples ( $x_1$ ) via an ODE: $\frac{\mathrm{d}x_t}{\mathrm{d}t} = v_{\theta}(x_t,m,c,t)$ . The model is trained using the Conditional Flow Matching (CFM) objective, minimizing $\| v_{\theta}(x_t,m,c,t) - u_t(x_t|x_1) \|^2$ , where $u_t(x_t|x_1) = (x_1 - x_t) / (1-t)$ for the time-linear forward process $x_t = t x_1 + (1-t) x_0$ . Mask conditioning is achieved by concatenating the mask channel-wise, and class conditioning by adding class embeddings to time embeddings. $\mathcal{F}$ is trained on data from $\mathcal{D}_\text{real}$ and $\mathcal{D}_\text{drand}$ but samples images conditioned exclusively on synthetic masks $\mathcal{M}_\text{syn}$ to avoid incorporating annotator biases from $\mathcal{D}_\text{real}$ . $\mathcal{D}_\text{flow}$ effectively broadens the distributions present in $\mathcal{D}_\text{real}$ in a data-driven manner. 10,000 image-mask pairs of size $128^3$ were sampled for $\mathcal{D}_\text{flow}$ .

VesselFM utilizes a UNet architecture [isensee2021nnu] for the segmentation model. It is trained on the combined dataset ( $\mathcal{D}_\text{real}$ , $\mathcal{D}_\text{drand}$ , $\mathcal{D}_\text{flow}$ ) with weights roughly proportional to their sizes (70% $\mathcal{D}_\text{drand}$ , 20% $\mathcal{D}_\text{real}$ , 10% $\mathcal{D}_\text{flow}$ ). Training uses a combination of Dice and cross-entropy losses, linear warm-up, and cosine annealing. Fine-tuning for one- and few-shot tasks is performed with lighter data augmentation.

The model was evaluated on four unseen datasets: SMILE-UHURA (human brain MRA), BvEM (mouse brain vEM), OCTA (mouse brain OCTA), and MSD8 (human liver CT), covering clinical and pre-clinical relevance. Evaluation metrics included Dice and clDice. VesselFM was compared against state-of-the-art medical segmentation foundation models: tUbeNet [holroyd2023tube], VISTA3D [he2024vista3d], SAM-Med3D [wang2024sammed3d], and MedSAM-2 [zhu2024medical].

Quantitative Results: VesselFM consistently and significantly outperforms all baseline models across zero-, one-, and few-shot settings on all evaluation datasets. In the zero-shot task, vesselFM achieves considerably higher Dice and clDice scores than baselines, demonstrating strong generalization even to challenging modalities like BvEM and OCTA, and outperforming models like VISTA3D trained on similar data (MSD8). General-purpose SAM-like models failed in the zero-shot setting for vessel segmentation. Fine-tuning vesselFM in one- or few-shot scenarios further improves performance, and ablations show that pre-training on the three data sources is crucial compared to training from scratch.

Qualitative Results: Visual results confirm the quantitative findings, showing that vesselFM segments blood vessels accurately with high fidelity, free of common artifacts, and preserving tubular structure. It also demonstrates the ability to segment other tubular structures (axons, colon) in some modalities, highlighting a strong inductive bias towards tubular shapes.

Ablation Studies:

Data Sources: Ablating the training data sources confirmed the importance of combining all three. Augmenting $\mathcal{D}_\text{real}$ with $\mathcal{D}_\text{drand}$ and $\mathcal{D}_\text{flow}$ resulted in a significant performance increase (e.g., +9.21 Dice on SMILE-UHURA zero-shot).
Flow Matching: Ablations on the flow matching model $\mathcal{F}$ showed that training $\mathcal{F}$ on $\mathcal{D}_\text{drand}$ is beneficial, conditioning on synthetic masks $\mathcal{M}_\text{syn}$ is better than real masks $\mathcal{M}_\text{real}$ (due to diversity and lack of annotator bias), and flow matching outperforms the diffusion-based Med-DDPM both quantitatively and qualitatively in generating realistic vessel images.
Architecture: Ablating the segmentation network architecture showed that the chosen UNet variant [isensee2021nnu] performed best compared to other UNet-based and Transformer-based models like SwinUNETR, UNETR, 3D UX-Net, and MedNeXt.

Implementation Considerations:

Data Requirements: Training requires a large, diverse dataset like $\mathcal{D}_\text{real}$ , synthetic data generation pipelines ( $\mathcal{D}_\text{drand}$ ), and a generative model like flow matching ( $\mathcal{D}_\text{flow}$ ). Curating and generating these datasets are substantial undertakings.
Computational Resources: Training vesselFM and the generative model $\mathcal{F}$ is computationally intensive. $\mathcal{F}$ generation alone took three days on a single RTX A6000 GPU for 10,000 samples. Training vesselFM used a single V100 GPU. Scaling to larger datasets or models would require more resources.
Architecture: The UNet architecture from nnU-Net is used, known for its effectiveness in medical image segmentation but potentially less parameter-efficient than Transformers for very large models.
Deployment: The model is designed for zero-, one-, or few-shot applications, making it practical for scenarios with limited annotated data in new domains. Inference on $128^3$ patches would be standard, requiring strategies for processing larger volumes (e.g., sliding window inference).
Limitations: Currently limited to binary blood vessel segmentation. The model also shows a tendency to segment other tubular structures, which might be considered a limitation depending on the specific application, although it also suggests a strong tubular-shape prior.

The authors have open-sourced the checkpoints and code, providing a practical, out-of-the-box tool for researchers and clinicians. Future work aims to extend vesselFM to segment other tubular structures, improve connectivity post-processing (potentially using graph-based methods), and handle multi-class segmentation tasks. The research significantly pushes the state-of-the-art in 3D blood vessel segmentation, potentially enabling more precise analysis, diagnosis, and treatment of vascular disorders.

PDF Markdown

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Related Papers

Authors (5)

Tweets

https://twitter.com/menze_group/status/1862132163166019821