Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
91 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
o3 Pro
5 tokens/sec
GPT-4.1 Pro
15 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
Gemini 2.5 Flash Deprecated
12 tokens/sec
2000 character limit reached

Video-to-Music Recommendation using Temporal Alignment of Segments (2306.07187v1)

Published 12 Jun 2023 in cs.MM, cs.IR, cs.LG, cs.SD, and eess.AS

Abstract: We study cross-modal recommendation of music tracks to be used as soundtracks for videos. This problem is known as the music supervision task. We build on a self-supervised system that learns a content association between music and video. In addition to the adequacy of content, adequacy of structure is crucial in music supervision to obtain relevant recommendations. We propose a novel approach to significantly improve the system's performance using structure-aware recommendation. The core idea is to consider not only the full audio-video clips, but rather shorter segments for training and inference. We find that using semantic segments and ranking the tracks according to sequence alignment costs significantly improves the results. We investigate the impact of different ranking metrics and segmentation methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal Transformer for Video Retrieval,” in Proceedings of ECCV (European Conference on Computer Vision), Glasgow, UK, 2020, pp. 214–229. [Online]. Available: https://doi.org/10.1007/978-3-030-58548-8_13
  2. M. Ma, S. Yoon, J. Kim, Y. Lee, S. Kang, and C. D. Yoo, “VLANet: Video-Language Alignment Network for Weakly-Supervised Video Moment Retrieval,” in Proceedings of ECCV (European Conference on Computer Vision), Glasgow, UK, 2020, pp. 156–171. [Online]. Available: https://doi.org/10.1007/978-3-030-58604-1_10
  3. A. Noulas, G. Englebienne, and B. J. Kröse, “Multimodal Speaker diarization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2012.
  4. A. Owens and A. A. Efros, “Audio-Visual Scene Analysis with Self-Supervised Multisensory Features,” in Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recognition), Salt Lake City, UT, USA, 2018. [Online]. Available: https://doi.org/10.1007/978-3-030-01231-1_39
  5. H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. Mcdermott, and A. Torralba, “The Sound of Pixels,” in Proceedings of ECCV (European Conference on Computer Vision), Munich, Germany, 2018. [Online]. Available: http://sound-of-pixels.csail.mit.edu
  6. V. Anger, “Traduire la musique en peinture. L’exemple de Paul Klee et Wassily Kandinsky,” Traduction et Transmédialité (XIXe-XXIe siècles), pp. 49–69, 2021.
  7. R. R. Shah, Y. Yu, and R. Zimmermann, “ADVISOR - Personalized video soundtrack recommendation by late fusion with heuristic rankings,” in Proceedings of ACM Multimedia, Orlando, FL, USA, 2014.
  8. C. Liao, P. P. Wang, and Y. Zhang, “Mining Association Patterns between Music and Video Clips in Professional MTV,” in Proceedings of MMM (International Conference on Multimedia Modeling), Sophia Antipolis, France, 2009. [Online]. Available: https://doi.org/10.1007/978-3-540-92892-8_41
  9. D. Zeng, Y. Yu, and K. Oyama, “Audio-Visual Embedding for Cross-Modal Music Video Retrieval through Supervised Deep CCA,” in Proceedings of IEEE ISM (International Symposium on Multimedia), Taichung, Taiwan, 2018.
  10. C. Inskip, A. Macfarlane, and P. Rafferty, “Music, Movies and Meaning: Communication in Film-makers’ Search for Pre-existing Music, and the Implications for Music Information Retrieval,” in Proceedings of ISMIR (International Conference on Music Information Retrieval), Philadelphia, PA, USA, 2008. [Online]. Available: https://archives.ismir.net/ismir2008/paper/000117.pdf
  11. S. Hong, W. Im, and H. S. Yang, “CBVMR: Content-based video-music retrieval using soft intra-modal structure constraint,” in Proceedings of ACM ICMR (International Conference on Multimedia Retrieval), Yokohama, Japan, 2018.
  12. K. Q. Weinberger and L. K. Saul, “Distance metric learning for large margin nearest neighbor classification,” Journal of Machine Learning Research, vol. 10, no. Feb, pp. 207–244, 2009. [Online]. Available: https://www.jmlr.org/papers/volume10/weinberger09a/weinberger09a.pdf
  13. F. Schroff and J. Philbin, “FaceNet: A Unified Embedding for Face Recognition and Clustering,” in Proceedings of IEEE CVPR (Conference on Computer Vision and Pattern Recognition), Boston, MA, USA, 2015.
  14. L. Prétet, G. Richard, and G. Peeters, “Design choices for cross-modal music-video recommendation,” in Special Session on RLASMP, IJCNN (International Joint Conference on Neural Networks), Virtual Event, 2021.
  15. ——, “Is there a ”language of music-video clips”? A qualitative and quantitative study,” in Proceedings of ISMIR (International Conference on Music Information Retrieval), Virtual Event, 2021. [Online]. Available: https://archives.ismir.net/ismir2021/paper/000067.pdf
  16. C. C. S. Liem, M. Larson, and A. Hanjalic, “When music makes a scene,” International Journal of Multimedia Information Retrieval, vol. 2, no. 1, pp. 15–30, 3 2013.
  17. S. Sasaki, T. Hirai, H. Ohya, and S. Morishima, “Affective Music Recommendation System Based on the Mood of Input Video,” in Proceedings of MMM (International Conference on Multimedia Modeling), Sydney, Australia, 2015.
  18. K.-H. Shin and I.-K. Lee, “Music synchronization with video using emotion similarity,” in Proceedings of IEEE BigComp (International Conference on Big Data and Smart Computing), Jeju Island, South Korea, 2017.
  19. C.-C. Hsia, K.-H. Lai, Y. Chen, C.-J. Wang, and M.-F. Tsai, “Representation Learning for Image-based Music Recommendation,” Proceedings of the Late-Breaking Results of ACM RecSys (Conference on Recommender Systems), 8 2018. [Online]. Available: http://arxiv.org/abs/1808.09198
  20. D. Surís, A. Duarte, A. Salvador, J. Torres, and X. Giró-I-Nieto, “Cross-modal Embeddings for Video and Audio Retrieval,” in Proceedings of ECCV Workshops (European Conference on Computer Vision), Munich, Germany, 2018. [Online]. Available: https://doi.org/10.1007/978-3-030-11018-5_62
  21. B. Li and A. Kumar, “Query by Video: Cross-Modal Music Retrieval,” in Proceedings of ISMIR (International Conference on Music Information Retrieval), Delft, The Netherlands, 2019. [Online]. Available: https://archives.ismir.net/ismir2019/paper/000073.pdf
  22. L. Shang, D. Zhang, J. Shen, E. L. Marmion, and D. Wang, “CCMR: A Classic-enriched Connotation-aware Music Retrieval System on Social Media with Visual Inputs,” Social Network Analysis and Mining, vol. 11, no. 1, p. 119, 12 2021.
  23. Y. Aytar, C. Vondrick, and A. Torralba, “SoundNet: Learning Sound Representations from Unlabeled Video,” Advances in Neural Information Processing Systems, pp. 892–900, 2016.
  24. R. Arandjelovic and A. Zisserman, “Look, Listen and Learn,” in Proceedings of IEEE ICASSP (International Conference on Computer Vision), Venice, Italy, 2017.
  25. F.-F. Kuo, M.-K. Shan, and S.-Y. Lee, “Background Music Recommendation for Video Based on Multimodal Latent Semantic Analysis,” in Proceedings of ICME (International Conference on Multimedia and Expo), San Jose, CA, USA, 2013.
  26. H. Su, F.-F. Kuo, C.-H. Chiu, Y.-J. Chou, and M.-K. Shan, “MediaEval 2013: Soundtrack Selection for Commercials Based on Content Correlation Modeling,” in MediaEval Multimedia Benchmark Workshop, Barcelona, Spain, 2013. [Online]. Available: http://ceur-ws.org/Vol-1043/mediaeval2013_submission_98.pdf
  27. D. Hu, Z. Wang, H. Xiong, D. Wang, F. Nie, and D. Dou, “Curriculum audiovisual learning,” arXiv preprint arXiv:2001.09414, 2020. [Online]. Available: https://arxiv.org/abs/2001.09414
  28. Y. Yu, S. Tang, F. Raposo, and L. Chen, “Deep Cross-Modal Correlation Learning for Audio and Lyrics in Music Retrieval,” ACM Transactions on Multimedia Computing, Communications, and Applications, vol. 15, no. 1, pp. 1–16, 2 2019.
  29. D. Hu, R. Qian, M. Jiang, X. Tan, S. Wen, E. Ding, W. Lin, and D. Dou, “Discriminative sounding objects localization via self-supervised audiovisual matching,” Advances in Neural Information Processing Systems, vol. 33, 2021. [Online]. Available: https://proceedings.neurips.cc/paper/2020/hash/7288251b27c8f0e73f4d7f483b06a785-Abstract.html
  30. O. Gillet, S. Essid, and G. Richard, “On the correlation of automatic audio and visual segmentations of music videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 17, no. 3, pp. 347–355, 2007.
  31. B. Li, K. Dinesh, C. Xu, G. Sharma, and Z. Duan, “Online Audio-Visual Source Association for Chamber Music Performances,” Transactions of the International Society for Music Information Retrieval, vol. 2, no. 1, 2019.
  32. J. Wang, Z. Fang, and H. Zhao, “AlignNet: A Unifying Approach to Audio-Visual Alignment,” in Proceedings of WACV (Winter Conference on Applications of Computer Vision), Snowmass Village, CO, USA, 2020, pp. 3298–3306.
  33. J. C. Lin, W. L. Wei, and H. M. Wang, “EMV-matchmaker: Emotional temporal course modeling and matching for automatic music video generation,” in Proceedings of ACM Multimedia, Brisbane, Australia, 2015, pp. 899–902.
  34. J.-C. Wang, Y.-H. Yang, I.-H. Jhuo, Y.-Y. Lin, and H.-M. Wang, “The acousticvisual emotion guassians model for automatic generation of music video,” in Proceedings of ACM Multimedia, Nara, Japan, 2012.
  35. J. C. Lin, W. L. Wei, J. Yang, H. M. Wang, and H. Y. M. Liao, “Automatic music video generation based on simultaneous soundtrack recommendation and video editing,” in Proceedings of MMM (International Conference on Multimedia Modeling), Reykjavik, Iceland, 2017.
  36. Z. Cheng and J. Shen, “On Effective Location-Aware Music Recommendation,” ACM Transactions on Information Systems, vol. 34, no. 2, pp. 1–32, 4 2016.
  37. J. Wang, E. Chng, C. Xu, Hanqinq Lu, and Q. Tian, “Generation of Personalized Music Sports Video Using Multimodal Cues,” IEEE Transactions on Multimedia, vol. 9, no. 3, 2007.
  38. J. Wang, E. Chng, and C. Xu, “Fully and Semi-Automatic Music Sports Video Composition,” in Proceedings of ICME (International Conference on Multimedia and Expo), Toronto, ON, Canada, 2006.
  39. X.-S. Hua, L. Lu, and H.-J. Zhang, “Automatic music video generation based on temporal pattern analysis,” in Proceedings of ACM Multimedia, New York City, NY, USA, 2004.
  40. S. Abu-El-Haija, N. Kothari, J. Lee, P. Natsev, G. Toderici, B. Varadarajan, and S. Vijayanarasimhan, “Youtube-8m: A large-scale video classification benchmark,” arXiv preprint arXiv:1609.08675, 2016. [Online]. Available: https://arxiv.org/abs/1609.08675
  41. B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of Python in Science, Austin, TX, USA, 2015.
  42. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. Channing Moore, M. Plakal, and M. Ritter, “Audio Set: An Ontology and Human-Labeled Dataset for Audio Events,” in Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech and Signal Processing), New Orleans, LA, USA, 2017.
  43. J. Foote, “Automatic audio segmentation using a measure of audio novelty,” in Proceedings of ICME (International Conference on Multimedia and Expo), New York City, NY, USA, 2000.
  44. J. Serra, M. Müller, P. Grosche, and J. L. Arcos, “Unsupervised Detection of Music Boundaries by Time Series Structure Features,” in Proceedings of the AAAI Conference on Artificial Intelligence, Toronto, ON, Canada, 2012. [Online]. Available: https://ojs.aaai.org/index.php/AAAI/article/view/8328
  45. M. Goto, “A chorus section detection method for musical audio signals and its application to a music listening station.” IEEE Transactions on Audio, Speech, and Language Processing, vol. 14, no. 5, pp. 1783–1794, 2006.
  46. B. McFee and D. P. Ellis, “Learning to segment songs with ordinal linear discriminant analysis,” in Proceedings of IEEE ICASSP (International Conference on Acoustics, Speech and Signal Processing), Florence, Italy, 2014, pp. 5197–5201.
  47. T. Souček, J. Moravec, and J. Lokoč, “TransNet: A deep network for fast detection of common shot transitions,” arXiv preprint arXiv:1906.03363, 2019. [Online]. Available: https://arxiv.org/abs/1906.03363
  48. O. Nieto and J. P. Bello, “Systematic Exploration Of Computational Music Structure Research.” in Proceedings of ISMIR (International Conference on Music Information Retrieval), New York City, NY, USA, 2016. [Online]. Available: https://archives.ismir.net/ismir2016/paper/000043.pdf
  49. S. B. Needleman and C. D. Wunsch, “A general method applicable to the search for similarities in the amino acid sequence of two proteins,” Journal of Molecular Biology, vol. 48, no. 3, pp. 443–453, 1970.
  50. T. F. Smith and M. S. Waterman, “Identification of common molecular subsequences,” Journal of molecular biology, vol. 147, no. 1, pp. 195–197, 1981. [Online]. Available: http://www.gersteinlab.org/courses/452/09-spring/pdf/sw.pdf
  51. A. Schindler and A. Rauber, “Harnessing music-related visual stereotypes for music information retrieval,” ACM Transactions on Intelligent Systems and Technology, vol. 8, no. 2, pp. 1–21, 2016.
  52. A. Schindler, “Multi-Modal Music Information Retrieval: Augmenting Audio-Analysis with Visual Computing for Improved Music Video Analysis,” Ph.D. dissertation, Technische Universität Wien, 2019. [Online]. Available: https://arxiv.org/abs/2002.00251
Citations (11)

Summary

We haven't generated a summary for this paper yet.