Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 100 tok/s

Gemini 2.5 Pro 58 tok/s Pro

GPT-5 Medium 29 tok/s

GPT-5 High 29 tok/s Pro

GPT-4o 103 tok/s

GPT OSS 120B 480 tok/s Pro

Kimi K2 215 tok/s Pro

2000 character limit reached

Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation (2508.20909v1)

Published 28 Aug 2025 in cs.CV and eess.IV

Abstract: Foundation models pre-trained on large-scale natural image datasets offer a powerful paradigm for medical image segmentation. However, effectively transferring their learned representations for precise clinical applications remains a challenge. In this work, we propose Dino U-Net, a novel encoder-decoder architecture designed to exploit the high-fidelity dense features of the DINOv3 vision foundation model. Our architecture introduces an encoder built upon a frozen DINOv3 backbone, which employs a specialized adapter to fuse the model's rich semantic features with low-level spatial details. To preserve the quality of these representations during dimensionality reduction, we design a new fidelity-aware projection module (FAPM) that effectively refines and projects the features for the decoder. We conducted extensive experiments on seven diverse public medical image segmentation datasets. Our results show that Dino U-Net achieves state-of-the-art performance, consistently outperforming previous methods across various imaging modalities. Our framework proves to be highly scalable, with segmentation accuracy consistently improving as the backbone model size increases up to the 7-billion-parameter variant. The findings demonstrate that leveraging the superior, dense-pretrained features from a general-purpose foundation model provides a highly effective and parameter-efficient approach to advance the accuracy of medical image segmentation. The code is available at https://github.com/yifangao112/DinoUNet.

Collections

Summary

The paper introduces Dino U-Net, integrating a frozen DINOv3 encoder with a U-Net decoder via a fidelity-aware projection module for high-precision segmentation.
It demonstrates superior performance, achieving a +1.87 Dice improvement and -3.04 HD95 reduction across seven diverse medical imaging datasets.
By freezing the encoder and training only lightweight modules, the approach ensures parameter efficiency and robust generalization in 2D medical image segmentation.

Dino U-Net: Leveraging High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation

Introduction

Dino U-Net introduces a hybrid encoder-decoder architecture that integrates a frozen DINOv3 vision transformer as the encoder with a U-Net decoder, targeting high-precision medical image segmentation. The central innovation lies in the exploitation of high-fidelity, dense features from DINOv3, a self-supervised vision foundation model, and the design of a fidelity-aware projection module (FAPM) to preserve and adapt these features for downstream segmentation. The architecture is evaluated across seven diverse medical imaging datasets, demonstrating consistent improvements over state-of-the-art baselines in both segmentation accuracy and parameter efficiency.

Figure 1: Overview of the Dino U-Net architecture, highlighting the frozen DINOv3 encoder, adapter, FAPM, and U-Net decoder. Only the adapter, FAPM, and decoder are trainable.

Architecture and Methodology

Encoder: Frozen DINOv3 with Adapter

The encoder leverages a frozen DINOv3 backbone, pre-trained on large-scale natural image data. To bridge the domain gap and fuse semantic and spatial information, an adapter module is introduced. The adapter employs a dual-branch design: a spatial prior module (SPM) extracts low-level spatial features, while the DINOv3 backbone provides high-level semantic features. These are fused via deformable cross-attention in a series of interaction blocks, producing multi-scale, semantically enriched feature maps.

Fidelity-Aware Projection Module (FAPM)

The FAPM addresses the challenge of dimensionality reduction from the high-dimensional DINOv3 features to the lower-dimensional space required by the U-Net decoder. It operates in two stages:

Orthogonal Decomposition: Multi-scale features are projected into a shared low-rank space using a 1x1 convolution, separating scale-invariant context from scale-specific details.
Progressive Refinement: Each scale undergoes further refinement via depthwise separable convolutions, squeeze-and-excitation (SE) blocks, and residual connections, restoring expressive capacity and ensuring high-fidelity feature transfer.
Figure 2: Detailed architecture of the FAPM, illustrating the orthogonal decomposition and progressive refinement stages.

Decoder: Standard U-Net

The decoder is a conventional U-Net, receiving the refined, multi-scale skip connections from the FAPM. This design enables precise localization and boundary delineation, leveraging the rich representations from the encoder.

Experimental Evaluation

Datasets and Baselines

Dino U-Net is evaluated on seven public datasets spanning endoscopy, fundus imaging, ultrasound, microscopy, cardiac MRI, prostate MRI, and surgical endoscopy. Baselines include nnU-Net, SegResNet, UNet++, U-Mamba, U-KAN, Swin U-Mamba, and SAM2-UNet, covering both conventional and foundation model-based approaches.

Implementation Details

All experiments are conducted in PyTorch, following nnU-Net's standardized preprocessing, data augmentation, and training protocols. Only the adapter, FAPM, and decoder are trainable; the DINOv3 backbone remains frozen. The models are trained with a combined Dice and cross-entropy loss, using Adam optimizer and polynomial learning rate decay. Inference employs a sliding window with Gaussian weighting.

Quantitative and Qualitative Results

Dino U-Net consistently outperforms all baselines across datasets in both Dice Similarity Coefficient and 95% Hausdorff Distance (HD95). The 7B-parameter variant achieves the highest average Dice (76.43%) and lowest average HD95 (18.76), with improvements of +1.87 Dice and -3.04 HD95 over the best baseline. Notably, even the smallest Dino U-Net S model (5.1M parameters) surpasses all baselines in average performance, demonstrating strong parameter efficiency.

Figure 3: Qualitative comparison of segmentation results across seven datasets, showing Dino U-Net's superior boundary delineation and completeness.

Ablation Study: FAPM

Ablation experiments confirm the critical role of FAPM. Removing FAPM and replacing it with simple 1x1 convolutions leads to a 0.56–0.79% drop in Dice and a 0.09–1.75mm increase in HD95, particularly impacting boundary precision. For the largest model, FAPM also reduces parameter count due to its parameter-sharing mechanism.

Figure 4: Ablation paper results, demonstrating FAPM's impact on parameter efficiency and segmentation accuracy.

Discussion

The results substantiate several key claims:

General-Purpose Features vs. Segmentation-Specific Features: DINOv3's self-supervised, general-purpose representations provide a more robust inductive bias for medical segmentation than segmentation-optimized encoders like SAM. This is attributed to DINOv3's ability to capture texture, context, and material properties, which are critical for differentiating ambiguous or low-contrast anatomical structures.
Scalability: Segmentation performance scales with the size of the DINOv3 backbone, with no observed saturation up to 7B parameters. This suggests that further scaling of foundation models may yield additional gains in medical imaging tasks.
Parameter Efficiency: By freezing the encoder and training only lightweight modules, Dino U-Net achieves state-of-the-art results with fewer trainable parameters, reducing overfitting risk and computational burden during training.
FAPM's Role: The fidelity-aware projection is essential for preserving high-fidelity features during dimensionality reduction, directly impacting boundary accuracy and overall segmentation quality.

Limitations and Future Directions

2D Limitation: The current architecture is restricted to 2D segmentation. Extending to 3D volumetric data (e.g., CT, MRI) is a logical next step.
Inference Cost: While training is efficient, inference with the 7B backbone is resource-intensive. Model distillation or pruning could mitigate this.
Generalization: Further evaluation on rare pathologies and cross-domain transfer would strengthen claims of generalization.

Conclusion

Dino U-Net demonstrates that integrating a frozen, large-scale self-supervised vision foundation model with a U-Net decoder, augmented by a fidelity-aware projection module, yields superior and efficient medical image segmentation. The approach leverages the representational power of billion-parameter models while maintaining practical training requirements. This work suggests that future advances in medical imaging will increasingly depend on the effective adaptation and projection of foundation model features, with further gains likely as model scale and projection techniques evolve.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (5)

GitHub

GitHub - yifangao112/DinoUNet (1 star)

Tweets

https://twitter.com/Manzarii/status/1963238846813901298

[2508.20909] Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation (1 point, 0 comments)

alphaXiv

Dino U-Net: Exploiting High-Fidelity Dense Features from Foundation Models for Medical Image Segmentation (15 likes, 0 questions)