Revisiting Deep Audio-Text Retrieval Through the Lens of Transportation (2405.10084v1)
Abstract: The Learning-to-match (LTM) framework proves to be an effective inverse optimal transport approach for learning the underlying ground metric between two sources of data, facilitating subsequent matching. However, the conventional LTM framework faces scalability challenges, necessitating the use of the entire dataset each time the parameters of the ground metric are updated. In adapting LTM to the deep learning context, we introduce the mini-batch Learning-to-match (m-LTM) framework for audio-text retrieval problems. This framework leverages mini-batch subsampling and Mahalanobis-enhanced family of ground metrics. Moreover, to cope with misaligned training data in practice, we propose a variant using partial optimal transport to mitigate the harm of misaligned data pairs in training data. We conduct extensive experiments on audio-text matching problems using three datasets: AudioCaps, Clotho, and ESC-50. Results demonstrate that our proposed method is capable of learning rich and expressive joint embedding space, which achieves SOTA performance. Beyond this, the proposed m-LTM framework is able to close the modality gap across audio and text embedding, which surpasses both triplet and contrastive loss in the zero-shot sound event detection task on the ESC-50 dataset. Notably, our strategy of employing partial optimal transport with m-LTM demonstrates greater noise tolerance than contrastive loss, especially under varying noise ratios in training data on the AudioCaps dataset. Our code is available at https://github.com/v-manhlt3/m-LTM-Audio-Text-Retrieval
- Iterative bregman projections for regularized transportation problems. SIAM Journal on Scientific Computing, 37(2):A1111–A1138, 2015.
- Partial optimal tranport with applications on positive-unlabeled learning. arXiv: Machine Learning, 2020.
- Hts-at: A hierarchical token-semantic audio transformer for sound classification and detection. ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 646–650, 2022.
- Uniter: Universal image-text representation learning. In European conference on computer vision, pp. 104–120. Springer, 2020.
- Discrete probabilistic inverse optimal transport. In International Conference on Machine Learning, pp. 3925–3946. PMLR, 2022.
- Joint distribution optimal transportation for domain adaptation. Advances in neural information processing systems, 30, 2017.
- Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. In NIPS, 2013.
- Audio retrieval with wavtext5k and clap training. ArXiv, abs/2209.14275, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- A swiss army knife for minimax optimal transport. In International Conference on Machine Learning, pp. 2504–2513. PMLR, 2020.
- Clotho: an audio captioning dataset. ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 736–740, 2019.
- Estimating matching affinity matrix under low-rank constraints. Econometric Modeling: Theoretical Issues in Microeconometrics eJournal, 2016.
- Cross modal audio search and retrieval with joint embeddings based on text and audio. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4095–4099. IEEE, 2019.
- Learning with minibatch wasserstein: asymptotic and gradient properties. In International Conference on Artificial Intelligence and Statistics, pp. 2131–2141. PMLR, 2020.
- Audio set: An ontology and human-labeled dataset for audio events. 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780, 2017.
- Learning generative models with sinkhorn divergences. In International Conference on Artificial Intelligence and Statistics, pp. 1608–1617. PMLR, 2018.
- Learning with noisy correspondence for cross-modal matching. Advances in Neural Information Processing Systems, 34:29406–29419, 2021.
- Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pp. 4904–4916. PMLR, 2021.
- Audiocaps: Generating captions for audios in the wild. In North American Chapter of the Association for Computational Linguistics, 2019.
- Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
- Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia, 2022.
- Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2880–2894, 2019.
- Learning to match via inverse optimal transport. J. Mach. Learn. Res., 20:80:1–80:37, 2018.
- Learning to match via inverse optimal transport. Journal of machine learning research, 20, 2019.
- Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. ArXiv, abs/2203.02053, 2022.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
- Scaling of matrices to achieve specified row and column sums. Numerische Mathematik, 12:83–90, 1968.
- On metric learning for audio-text cross-modal retrieval. In Interspeech, 2022.
- Sliced Wasserstein estimation with control variates. In The Twelfth International Conference on Learning Representations, 2024.
- On transportation of mini-batches: A hierarchical approach. In International Conference on Machine Learning, pp. 16622–16655. PMLR, 2022a.
- Improving mini-batch optimal transport via partial transportation. In International Conference on Machine Learning, pp. 16656–16690. PMLR, 2022b.
- Sliced wasserstein with random-path projecting directions. arXiv preprint arXiv:2401.15889, 2024.
- Audio retrieval with natural language queries. In Interspeech, 2021.
- Subspace robust wasserstein distances. In International conference on machine learning, pp. 5072–5081. PMLR, 2019.
- Karol J. Piczak. Esc: Dataset for environmental sound classification. Proceedings of the 23rd ACM international conference on Multimedia, 2015.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PMLR, 2021.
- R Tyrrell Rockafellar. Convex analysis, volume 11. Princeton university press, 1997.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565, 2018.
- Understanding and generalizing contrastive learning from the inverse optimal transport perspective. 2023.
- Inverse optimal transport. SIAM J. Appl. Math., 80:599–619, 2019.
- Learning coupled feature spaces for cross-modal matching. In Proceedings of the IEEE international conference on computer vision, pp. 2088–2095, 2013.
- Universal weighting metric learning for cross-modal matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 13005–13014, 2020.
- Universal weighting metric learning for cross-modal retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):6534–6545, 2021.
- Data efficient language-supervised zero-shot recognition with optimal transport distillation. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022a.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. ArXiv, abs/2211.06687, 2022b.
- Videoclip: Contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084, 2021.
- Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15671–15680, 2022.
- Socratic models: Composing zero-shot multimodal reasoning with language. arXiv preprint arXiv:2204.00598, 2022.