Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

On the Effect of Data-Augmentation on Local Embedding Properties in the Contrastive Learning of Music Audio Representations (2401.08889v1)

Published 17 Jan 2024 in cs.SD, cs.IR, cs.LG, cs.MM, and eess.AS

Abstract: Audio embeddings are crucial tools in understanding large catalogs of music. Typically embeddings are evaluated on the basis of the performance they provide in a wide range of downstream tasks, however few studies have investigated the local properties of the embedding spaces themselves which are important in nearest neighbor algorithms, commonly used in music search and recommendation. In this work we show that when learning audio representations on music datasets via contrastive learning, musical properties that are typically homogeneous within a track (e.g., key and tempo) are reflected in the locality of neighborhoods in the resulting embedding space. By applying appropriate data augmentation strategies, localisation of such properties can not only be reduced but the localisation of other attributes is increased. For example, locality of features such as pitch and tempo that are less relevant to non-expert listeners, may be mitigated while improving the locality of more salient features such as genre and mood, achieving state-of-the-art performance in nearest neighbor retrieval accuracy. Similarly, we show that the optimal selection of data augmentation strategies for contrastive learning of music audio embeddings is dependent on the downstream task, highlighting this as an important embedding design decision.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. “Contrastive learning of general-purpose audio representations,” in ICASSP, 2021, pp. 3875–3879.
  2. “Towards learning universal audio representations,” in ICASSP, 2022, pp. 4593–4597.
  3. “mixup: Beyond empirical risk minimization,” in ICLR, 2018.
  4. “Music representation learning based on editorial metadata from discogs,” in ISMIR, 2022, pp. 825–833.
  5. “Pre-training strategies using contrastive learning and playlist information for music classification and similarity,” in ICASSP, 2023, pp. 1–5.
  6. “Are nearby neighbors relatives? testing deep music embeddings,” Frontiers Appl. Math. Stat., vol. 5, pp. 53, 2019.
  7. “Supervised and unsupervised learning of audio representations for music understanding,” in ISMIR, 2022, pp. 256–263.
  8. “Locality pursuit embedding,” Pattern Recognit., vol. 37, no. 4, pp. 781–788, 2004.
  9. “Similar but faster: manipulation of tempo in music audio embeddings for tempo prediction and search,” in ICASSP, 2024.
  10. J. Spijkervet and J. A. Burgoyne, “Contrastive learning of musical representations,” in ISMIR, 2021, pp. 673–681.
  11. “Data augmenting contrastive learning of speech representations in the time domain,” in IEEE SLT Workshop, 2021, pp. 215–222.
  12. “Towards proper contrastive self-supervised learning strategies for music audio representation,” in ICME, 2022, pp. 1–6.
  13. “BYOL for audio: Exploring pre-trained general-purpose audio representations,” IEEE/ACM TASLP, vol. 31, pp. 137–151, 2022.
  14. L. Wang and A. v. d. Oord, “Multi-format contrastive learning of audio representations,” arXiv:2103.06508, 2021.
  15. “Contrastive learning with positive-negative frame mask for music representation,” in ACM Web Conference, 2022, pp. 2906–2915.
  16. “S3T: Self-supervised pre-training with swin transformer for music classification,” in ICASSP. IEEE, 2022, pp. 606–610.
  17. “Unsupervised learning of semantic audio representations,” in ICASSP, 2018, pp. 126–130.
  18. “MERT: Acoustic music understanding model with large-scale self-supervised training,” arXiv:2306.00107, 2023.
  19. “Evaluation of algorithms using games: The case of music tagging.,” in ISMIR, 2009, pp. 387–392.
  20. “Metric learning vs classification for disentangled music representation learning,” in ISMIR, 2020, pp. 439–445.
  21. “Disentangled multidimensional metric learning for music similarity,” in ICASSP, 2020, pp. 6–10.
  22. M. C. McCallum, “Unsupervised learning of deep features for music segmentation,” in ICASSP, 2019, pp. 346–350.
  23. “A simple framework for contrastive learning of visual representations,” in ICML, 2020, pp. 1597–1607.
  24. “The HTK book,” Cambridge Univ. Eng. Dept., vol. 3, no. 175, pp. 12, 2002.
  25. “SpecAugment: A simple data augmentation method for automatic speech recognition,” in ISCA, 2019, pp. 2613–2617.
  26. “Improving self-supervised learning for audio representations by feature diversity and decorrelation,” ICASSP, 2023.
  27. “MAST: Multiscale audio spectrogram transformers,” ICASSP, 2022.
  28. “Unsupervised contrastive learning of sound event representations,” in ICASSP, 2021, pp. 371–375.
  29. S. Böck and M. E. P. Davies, “Deconstruct, analyse, reconstruct: How to improve tempo, beat, and downbeat estimation,” in ISMIR, 2020, pp. 574–582.
  30. “The Harmonix set: Beats, downbeats, and functional segment annotations of western popular music,” in ISMIR, 2019, pp. 565–572.
  31. U. Marchand and G. Peeters, “Swing ratio estimation,” in DAFx, 2015, pp. 423–428.
  32. H. Schreiber and M. Müller, “A crowdsourced experiment for tempo estimation of electronic dance music.,” in ISMIR, 2018, pp. 409–415.
  33. “Two data sets for tempo estimation and key detection in electronic dance music annotated from user corrections,” in ISMIR, 2015, pp. 364–370.
  34. G. Percival and G. Tzanetakis, “Streamlined tempo estimation based on autocorrelation and cross-correlation with pulses,” IEEE TASLP, vol. 22, no. 12, pp. 1765–1776, 2014.
  35. “Neural audio synthesis of musical notes with wavenet autoencoders,” in ICML, 2017, pp. 1068–1077.
  36. G. Tzanetakis and P. Cook, “Musical genre classification of audio signals,” IEEE TSAP, vol. 10, no. 5, pp. 293–302, 2002.
  37. “The million song dataset,” in ISMIR, 2011.
  38. “Multi-task learning of tempo and beat: Learning one to improve the other.,” in ISMIR, 2019, pp. 486–493.
  39. H. Schreiber and M. Müller, “A single-step approach to musical tempo estimation using a convolutional neural network,” in ISMIR, 2018, pp. 100–105.
  40. “An experimental comparison of audio tempo induction algorithms,” IEEE TASLP, vol. 14, no. 5, pp. 1832–1844, 2006.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Matthew C. McCallum (5 papers)
  2. Matthew E. P. Davies (14 papers)
  3. Florian Henkel (12 papers)
  4. Jaehun Kim (17 papers)
  5. Samuel E. Sandberg (3 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com