- The paper proposes the first self-supervised method for learning representations from ultrasound video-speech data using inherent inter-modal correlation.
- The method uses cross-modal contrastive learning and affinity-aware self-paced learning with VAD and frame order prediction to model video-speech correspondence.
- Learned representations effectively transfer to downstream tasks, showing strong performance on standard plane detection and eye-gaze saliency prediction.
The paper "Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound" (2008.06607) introduces a self-supervised method for learning representations from multi-modal ultrasound video-speech data, leveraging the inherent correlation between the visual and auditory streams without relying on manual annotations.
Proposed Methodology
The core of the approach lies in exploiting the synchronicity between ultrasound video and the sonographer's narrative speech. The underlying assumption is that a robust model can be trained to identify this correlation while simultaneously capturing relevant anatomical features present in the ultrasound imagery. The proposed framework models the correspondence between video and audio through two key components: cross-modal contrastive learning and affinity-aware self-paced learning.
Cross-Modal Contrastive Learning
This component aims to bring the representations of corresponding video-speech pairs closer together in a shared latent space, while pushing apart the representations of non-corresponding (negative) pairs. This is achieved by:
- Projecting video and speech features into a shared latent space using separate encoder networks.
- Applying a contrastive loss function that minimizes the distance between embeddings of positive pairs (i.e., synchronized video and speech) and maximizes the distance between embeddings of negative pairs. The loss function can be expressed as:
L=i=1∑N[d(f(vi),g(si))+max(0,m−d(f(vi),g(sj)))]
where vi and si represent the i-th video and speech samples, respectively; f and g are the video and speech encoders, respectively; d is a distance metric (e.g., cosine similarity); m is a margin; and sj is a negative speech sample paired with video vi.
- Employing a hard negative mining strategy to select challenging negative samples that are semantically similar to the positive samples. Hard positive mining is also performed to ensure that the model can learn from the most difficult positive pairs.
Affinity-Aware Self-Paced Learning
This component addresses the challenges posed by background noise and irrelevant speech segments present in the audio data. It incorporates an energy-based voice activity detection (VAD) algorithm to divide the multi-modal data into different affinity levels, namely high-affinity and low-affinity segments. The self-paced learning scheme then prioritizes learning from high-affinity segments, where the correlation between video and speech is stronger. For low-affinity segments, the video data is leveraged by introducing an auxiliary pretext task of frame order prediction. In this task, the model is trained to predict the correct order of shuffled frames within a video clip, thereby encouraging the model to learn meaningful representations from the video data even in the absence of strong audio-visual correlation.
Experimental Evaluation
The learned representations are evaluated on two ultrasound-related downstream tasks to assess their generalization ability:
- Standard Plane Detection: This task involves classifying fetal ultrasound scans into 14 categories of standard planes (e.g., heart four-chamber view, brain transventricular plane). The dataset consists of a collection of fetal ultrasound scans with expert annotations for standard planes.
- Eye-Gaze Saliency Prediction: This task aims to predict a 2D saliency map of the sonographer's eye-gaze during ultrasound examination. The same fetal ultrasound dataset is used, which also includes eye-gaze data recorded from sonographers. The prediction is framed as a regression problem, where the model learns to map the learned representations to the eye-gaze saliency map.
The results of the experiments demonstrate that the proposed approach is able to learn strong representations that transfer well to both downstream tasks.
Contributions
The main contributions of the paper are:
- Proposing the first self-supervised video-speech representation learning approach for ultrasound data, addressing the scarcity of annotated data in medical imaging.
- Introducing cross-modal contrastive learning and affinity-aware self-paced learning to effectively model the correspondence between video and speech in ultrasound data.
- Demonstrating the effectiveness of the proposed approach on two clinically relevant downstream tasks, showcasing its potential for improving ultrasound image analysis and interpretation.
In summary, this paper presents a novel self-supervised learning framework for ultrasound data that leverages the correlation between video and speech modalities. The proposed cross-modal contrastive learning and affinity-aware self-paced learning schemes enable the model to learn robust representations without manual annotations. The experimental results on standard plane detection and eye-gaze prediction demonstrate the effectiveness of the proposed approach for downstream tasks.