FreqDINO: Frequency-Guided Ultrasound Segmentation
- FreqDINO is a frequency-guided segmentation framework that enhances boundary localization in ultrasound images using advanced transformer representations.
- The system integrates multi-scale frequency extraction and boundary-guided feature refinement to mitigate speckle noise and improve structural accuracy.
- Quantitative results reveal improvements in Dice scores and reduced Hausdorff Distance, validating its effectiveness over baseline models.
FreqDINO is a frequency-guided segmentation framework designed for generalized, boundary-aware ultrasound image segmentation, combining state-of-the-art visual transformer representations with explicitly frequency-driven mechanisms to enhance boundary localization and structural accuracy in challenging medical imaging scenarios. The method addresses modality-specific degradation, notably speckle noise and boundary artifacts that impair performance when using vision transformers pretrained on natural images. Central to FreqDINO is the integration of multi-scale frequency extraction, boundary feature alignment, and frequency-guided boundary refinement within a unified deep learning architecture (Zhang et al., 12 Dec 2025).
1. Architectural Foundations and Motivations
FreqDINO builds upon the DINOv3 visual transformer, leveraging its strong feature extraction abilities but introducing domain-specific enhancements to improve sensitivity to ultrasound-specific boundary challenges. The motivation is predicated on the observation that models pre-trained on natural images lack effective mechanisms to distinguish high-frequency boundary details from modality-specific noise, resulting in smoothed or imprecise segmentation borders. FreqDINO introduces frequency-guided modules—specifically, Multi-scale Frequency Extraction and Alignment (MFEA), the Frequency-Guided Boundary Refinement (FGBR) module, and a Multi-task Boundary-Guided Decoder (MBGD)—to explicitly enhance boundary perception and enforce structural consistency in the final segmentation (Zhang et al., 12 Dec 2025).
2. Multi-scale Frequency Extraction and Alignment (MFEA)
The MFEA component separates the backbone spatial features into low-frequency structure and multi-scale high-frequency boundary representations to enable frequency-disentangled processing. The process is initiated by applying a Haar wavelet transform to the spatial feature map produced by the DINOv3 encoder and adapters. The wavelet decomposition produces four subbands: , , , and . Fine-scale boundary features are obtained by concatenating and reducing with a convolution , while coarse-scale features are generated via further down-up sampling, another Haar transform, and a similar reduction procedure. Both and are tensors with shape , where is the batch size, the channel width, and the spatial resolution (Zhang et al., 12 Dec 2025).
3. Frequency-Guided Boundary Refinement (FGBR) Module
At the core of FreqDINO is the FGBR module, which exploits frequency-extracted features to enforce boundary sensitivity:
- Boundary Prototype Extraction: The two high-frequency maps are concatenated across channels to form . A stack of two convolutional layers with ReLU activations is applied (first mapping channels, then ), followed by global average pooling across spatial dimensions, yielding a batch of 64-dimensional boundary prototypes .
- Boundary-Guided Feature Refinement: Enhanced spatial features (from MFEA) are reshaped for attention as and projected to query vectors. The boundary prototype is linearly projected to obtain key/value tensors for an 8-head scaled dot-product attention. The attention output is reshaped and added (with residual scale ) back to , forming (Zhang et al., 12 Dec 2025).
The FGBR module thus fuses frequency-derived boundary statistics with spatial detail, directly influencing learned segmentation boundaries.
4. Multi-Task Boundary-Guided Decoder (MBGD) and Integrated Pipeline
enters the MBGD, which upscales features and computes both semantic segmentation masks and explicit boundary maps:
- The decoder applies four transposed convolution upsampling ("UpBlocks") to produce a high-resolution .
- A convolution produces preliminary boundary logits , transformed into a soft mask via sigmoid and refined with a convolution for the final boundary output.
- The semantic mask head takes as input the concatenation of and the boundary prediction, followed by a convolution.
The pipeline sequence is: Input → DINOv3 encoder → MFEA → FGBR → MBGD → semantic & boundary predictions (Zhang et al., 12 Dec 2025).
5. Quantitative Performance and Ablation
Experimental results underscore the contribution of the FGBR module within FreqDINO. On ultrasound segmentation benchmarks:
- The base DINOv3 + adapters yields Dice = 82.35%, HD = 47.59 mm.
- Adding MFEA alone improves to Dice = 84.17%, HD = 44.59 mm.
- Adding FGBR atop MFEA further yields Dice = 85.13% (+0.96), HD = 43.02 mm (–1.57 mm).
- The full FreqDINO (MFEA + FGBR + MBGD) records Dice = 86.52%, HD = 39.63 mm.
This demonstrates that FGBR provides a measurable boost in boundary accuracy and overall segmentation agreement relative to frequency feature extraction alone (Zhang et al., 12 Dec 2025).
6. Comparative and Related Approaches
FreqDINO's FGBR concept is related to the Frequency-Guided Boundary Refinement mechanisms appearing across scientific domains, with notable analogs:
- In axisymmetric droplet simulations, a signal processing approach uses Fourier-domain envelope analysis of curvature to guide mesh refinement, delivering robust and parametric grid adaptation for capturing singularity formation (Koga, 2019).
- Temporal action detection in video leverages frequency decoupling to suppress low-frequency background and amplifies atomic (high-frequency) segment boundaries, with analogous modules for frequency-guided action boundary localization (Zhu et al., 1 Apr 2025).
A plausible implication is that the frequency-guided signal processing paradigm is establishing a methodological connection between computational physics, video understanding, and medical image analysis, where boundary localization under noise and class imbalance is critical.
7. Implementation Considerations and Reproducibility
Implementation of FreqDINO's FGBR should adhere to the specifications described: minimal prototype extractor (two convolutions, ReLU), standard multi-head attention with 8 heads and $128$-dimension per head, and lightweight residual integration with . The architecture relies on standard PyTorch MultiheadAttention primitives and basic convolutional units. The code for FreqDINO is available at https://github.com/MingLang-FD/FreqDINO (Zhang et al., 12 Dec 2025).
In summary, FreqDINO combines frequency decomposition, boundary prototype learning, and attention-driven feature refinement to deliver state-of-the-art segmentation, particularly excelling in boundary-sensitive, high-noise imaging contexts characteristic of ultrasound.