Papers
Topics
Authors
Recent
2000 character limit reached

Self-Supervised Ultrasound-Video Segmentation with Feature Prediction and 3D Localised Loss (2507.18424v1)

Published 24 Jul 2025 in cs.CV

Abstract: Acquiring and annotating large datasets in ultrasound imaging is challenging due to low contrast, high noise, and susceptibility to artefacts. This process requires significant time and clinical expertise. Self-supervised learning (SSL) offers a promising solution by leveraging unlabelled data to learn useful representations, enabling improved segmentation performance when annotated data is limited. Recent state-of-the-art developments in SSL for video data include V-JEPA, a framework solely based on feature prediction, avoiding pixel level reconstruction or negative samples. We hypothesise that V-JEPA is well-suited to ultrasound imaging, as it is less sensitive to noisy pixel-level detail while effectively leveraging temporal information. To the best of our knowledge, this is the first study to adopt V-JEPA for ultrasound video data. Similar to other patch-based masking SSL techniques such as VideoMAE, V-JEPA is well-suited to ViT-based models. However, ViTs can underperform on small medical datasets due to lack of inductive biases, limited spatial locality and absence of hierarchical feature learning. To improve locality understanding, we propose a novel 3D localisation auxiliary task to improve locality in ViT representations during V-JEPA pre-training. Our results show V-JEPA with our auxiliary task improves segmentation performance significantly across various frozen encoder configurations, with gains up to 3.4\% using 100\% and up to 8.35\% using only 10\% of the training data.

Summary

  • The paper introduces V-JEPA with a novel 3D localisation auxiliary task that improves spatial and temporal feature representation in ultrasound video segmentation.
  • Experiments on the CAMUS dataset show an 8.35% DSC gain using only 10% of training data, highlighting the robustness of the method under data scarcity.
  • The work advances self-supervised learning in medical imaging by alleviating annotation challenges and enhancing ViT representations for ultrasound analysis.

Self-Supervised Ultrasound-Video Segmentation with Feature Prediction and 3D Localised Loss

Introduction

The paper "Self-Supervised Ultrasound-Video Segmentation with Feature Prediction and 3D Localised Loss" (2507.18424) discusses the challenges involved in acquiring and annotating large datasets for ultrasound imaging, particularly due to issues like low contrast and high noise. It highlights the potential of self-supervised learning (SSL) as an approach to leverage unlabelled data, thereby enhancing segmentation performance even when annotated data is scarce. The paper introduces the V-JEPA framework, which is based on feature prediction rather than pixel-level reconstruction or negative sampling, making it suitable for ultrasound video segmentation.

Methodology

The paper adopts the V-JEPA framework for ultrasound video segmentation, which focuses on abstract representations through masked latent feature prediction. This method avoids negative sampling and pixel reconstruction, which are common in contrastive and generative SSL approaches. To address the limitations faced by Vision Transformers (ViTs), particularly in small medical datasets, the paper introduces a novel 3D localisation auxiliary task. This auxiliary task enhances spatial and temporal sensitivity during V-JEPA pre-training, improving the locality of ViT representations. Figure 1

Figure 1: Block Diagram of our 3D localisation auxiliary task incorporated in the V-JEPA SSL framework.

The auxiliary task calculates relative temporal, vertical, and horizontal distances between randomly sampled patch embeddings, enhancing ViT's spatial understanding during pre-training. The combined loss function integrates both JEPA and local loss, weighted by a hyperparameter λ\lambda, which dictates the importance of spatial localisation in the overall learning process.

Experiments and Results

The experiments conducted on the CAMUS dataset demonstrate the effectiveness of the proposed approach. Various metrics such as Dice Similarity Coefficient (DSC), Jaccard Index (JI), precision, and recall are used to evaluate segmentation performance. The results indicate that integrating the local loss task with V-JEPA significantly improves segmentation accuracy, especially under scenarios with limited training data.

Comparative analysis shows that V-JEPA with the localisation task yields better results than VideoMAE and supervised learning models. Specifically, performance gains of up to 8.35% in DSC were observed when only 10% of the training data was used, emphasizing the robustness of the method in limited data scenarios.

Implications and Future Work

This research has notable implications for the practical application of SSL in medical imaging, particularly in settings where data acquisition is challenging. The proposed 3D localisation auxiliary task presents a promising direction for enhancing the representation learning capability of ViTs, making them more suitable for medical datasets that typically suffer from small sample sizes.

Future work could explore the application of this method across diverse ultrasound video datasets. Additionally, integrating complementary approaches like hierarchical transformers or pyramid vision transformers could enhance performance further. These strategies can help address the inherent shortcomings of ViTs, such as limited spatial locality and lack of hierarchical feature learning, potentially making them more effective in broader medical imaging contexts.

Conclusion

The paper effectively demonstrates the potential of self-supervised learning, particularly the V-JEPA framework, in advancing ultrasound video segmentation performance. By addressing the spatial locality limitations inherent in ViT models through a 3D localisation auxiliary task, the paper provides a valuable contribution to improving medical image analysis. Future research directions will continue to refine and expand on these findings, aiming to enhance the efficacy of SSL techniques in medical imaging applications.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.