Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound (2008.06607v1)

Published 14 Aug 2020 in cs.CV

Abstract: In medical imaging, manual annotations can be expensive to acquire and sometimes infeasible to access, making conventional deep learning-based models difficult to scale. As a result, it would be beneficial if useful representations could be derived from raw data without the need for manual annotations. In this paper, we propose to address the problem of self-supervised representation learning with multi-modal ultrasound video-speech raw data. For this case, we assume that there is a high correlation between the ultrasound video and the corresponding narrative speech audio of the sonographer. In order to learn meaningful representations, the model needs to identify such correlation and at the same time understand the underlying anatomical features. We designed a framework to model the correspondence between video and audio without any kind of human annotations. Within this framework, we introduce cross-modal contrastive learning and an affinity-aware self-paced learning scheme to enhance correlation modelling. Experimental evaluations on multi-modal fetal ultrasound video and audio show that the proposed approach is able to learn strong representations and transfers well to downstream tasks of standard plane detection and eye-gaze prediction.

Citations (37)

View on Semantic Scholar

Summary

The paper proposes the first self-supervised method for learning representations from ultrasound video-speech data using inherent inter-modal correlation.
The method uses cross-modal contrastive learning and affinity-aware self-paced learning with VAD and frame order prediction to model video-speech correspondence.
Learned representations effectively transfer to downstream tasks, showing strong performance on standard plane detection and eye-gaze saliency prediction.

The paper "Self-supervised Contrastive Video-Speech Representation Learning for Ultrasound" (2008.06607) introduces a self-supervised method for learning representations from multi-modal ultrasound video-speech data, leveraging the inherent correlation between the visual and auditory streams without relying on manual annotations.

Proposed Methodology

The core of the approach lies in exploiting the synchronicity between ultrasound video and the sonographer's narrative speech. The underlying assumption is that a robust model can be trained to identify this correlation while simultaneously capturing relevant anatomical features present in the ultrasound imagery. The proposed framework models the correspondence between video and audio through two key components: cross-modal contrastive learning and affinity-aware self-paced learning.

Cross-Modal Contrastive Learning

This component aims to bring the representations of corresponding video-speech pairs closer together in a shared latent space, while pushing apart the representations of non-corresponding (negative) pairs. This is achieved by:

Projecting video and speech features into a shared latent space using separate encoder networks.
Applying a contrastive loss function that minimizes the distance between embeddings of positive pairs (i.e., synchronized video and speech) and maximizes the distance between embeddings of negative pairs. The loss function can be expressed as:

$L = \sum_{i=1}^{N} [d(f(v_i), g(s_i)) + \max(0, m - d(f(v_i), g(s_j)))]$

where $v_i$ and $s_i$ represent the $i$ -th video and speech samples, respectively; $f$ and $g$ are the video and speech encoders, respectively; $d$ is a distance metric (e.g., cosine similarity); $m$ is a margin; and $s_j$ is a negative speech sample paired with video $v_i$ .
Employing a hard negative mining strategy to select challenging negative samples that are semantically similar to the positive samples. Hard positive mining is also performed to ensure that the model can learn from the most difficult positive pairs.

Affinity-Aware Self-Paced Learning

This component addresses the challenges posed by background noise and irrelevant speech segments present in the audio data. It incorporates an energy-based voice activity detection (VAD) algorithm to divide the multi-modal data into different affinity levels, namely high-affinity and low-affinity segments. The self-paced learning scheme then prioritizes learning from high-affinity segments, where the correlation between video and speech is stronger. For low-affinity segments, the video data is leveraged by introducing an auxiliary pretext task of frame order prediction. In this task, the model is trained to predict the correct order of shuffled frames within a video clip, thereby encouraging the model to learn meaningful representations from the video data even in the absence of strong audio-visual correlation.

Experimental Evaluation

The learned representations are evaluated on two ultrasound-related downstream tasks to assess their generalization ability:

Standard Plane Detection: This task involves classifying fetal ultrasound scans into 14 categories of standard planes (e.g., heart four-chamber view, brain transventricular plane). The dataset consists of a collection of fetal ultrasound scans with expert annotations for standard planes.
Eye-Gaze Saliency Prediction: This task aims to predict a 2D saliency map of the sonographer's eye-gaze during ultrasound examination. The same fetal ultrasound dataset is used, which also includes eye-gaze data recorded from sonographers. The prediction is framed as a regression problem, where the model learns to map the learned representations to the eye-gaze saliency map.

The results of the experiments demonstrate that the proposed approach is able to learn strong representations that transfer well to both downstream tasks.

Contributions

The main contributions of the paper are:

Proposing the first self-supervised video-speech representation learning approach for ultrasound data, addressing the scarcity of annotated data in medical imaging.
Introducing cross-modal contrastive learning and affinity-aware self-paced learning to effectively model the correspondence between video and speech in ultrasound data.
Demonstrating the effectiveness of the proposed approach on two clinically relevant downstream tasks, showcasing its potential for improving ultrasound image analysis and interpretation.

In summary, this paper presents a novel self-supervised learning framework for ultrasound data that leverages the correlation between video and speech modalities. The proposed cross-modal contrastive learning and affinity-aware self-paced learning schemes enable the model to learn robust representations without manual annotations. The experimental results on standard plane detection and eye-gaze prediction demonstrate the effectiveness of the proposed approach for downstream tasks.

PDF Markdown