Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval (2310.13451v1)

Published 20 Oct 2023 in cs.SD, cs.CV, cs.IR, cs.MM, and eess.AS

Abstract: The cross-modal retrieval model leverages the potential of triple loss optimization to learn robust embedding spaces. However, existing methods often train these models in a singular pass, overlooking the distinction between semi-hard and hard triples in the optimization process. The oversight of not distinguishing between semi-hard and hard triples leads to suboptimal model performance. In this paper, we introduce a novel approach rooted in curriculum learning to address this problem. We propose a two-stage training paradigm that guides the model's learning process from semi-hard to hard triplets. In the first stage, the model is trained with a set of semi-hard triplets, starting from a low-loss base. Subsequently, in the second stage, we augment the embeddings using an interpolation technique. This process identifies potential hard negatives, alleviating issues arising from high-loss functions due to a scarcity of hard triples. Our approach then applies hard triplet mining in the augmented embedding space to further optimize the model. Extensive experimental results conducted on two audio-visual datasets show a significant improvement of approximately 9.8% in terms of average Mean Average Precision (MAP) over the current state-of-the-art method, MSNSCA, for the Audio-Visual Cross-Modal Retrieval (AV-CMR) task on the AVE dataset, indicating the effectiveness of our proposed method.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (27)

Authors (2)

Donghuo Zeng (22 papers)
Kazushi Ikeda (19 papers)

Citations (1)

View on Semantic Scholar

Two-Stage Triplet Loss Training with Curriculum Augmentation for Audio-Visual Retrieval (2310.13451v1)

Related Papers