Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visually-Guided Sound Source Separation with Audio-Visual Predictive Coding (2306.10684v1)

Published 19 Jun 2023 in cs.SD, cs.CV, cs.MM, and eess.AS

Abstract: The framework of visually-guided sound source separation generally consists of three parts: visual feature extraction, multimodal feature fusion, and sound signal processing. An ongoing trend in this field has been to tailor involved visual feature extractor for informative visual guidance and separately devise module for feature fusion, while utilizing U-Net by default for sound analysis. However, such divide-and-conquer paradigm is parameter inefficient and, meanwhile, may obtain suboptimal performance as jointly optimizing and harmonizing various model components is challengeable. By contrast, this paper presents a novel approach, dubbed audio-visual predictive coding (AVPC), to tackle this task in a parameter efficient and more effective manner. The network of AVPC features a simple ResNet-based video analysis network for deriving semantic visual features, and a predictive coding-based sound separation network that can extract audio features, fuse multimodal information, and predict sound separation masks in the same architecture. By iteratively minimizing the prediction error between features, AVPC integrates audio and visual information recursively, leading to progressively improved performance. In addition, we develop a valid self-supervised learning strategy for AVPC via co-predicting two audio-visual representations of the same sound source. Extensive evaluations demonstrate that AVPC outperforms several baselines in separating musical instrument sounds, while reducing the model size significantly. Code is available at: https://github.com/zjsong/Audio-Visual-Predictive-Coding.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (80)
  1. G. W. Hartmann, “Changes in visual acuity through simultaneous stimulation of other sense organs,” J. Exp. Psychol., vol. 16, no. 3, pp. 393–407, 1933.
  2. P. A. Neil, C. Chee-Ruiter, C. Scheier, D. J. Lewkowicz, and S. Shimojo, “Development of multisensory spatial integration and perception in humans,” Dev. Sci., vol. 9, no. 5, pp. 454–464, 2006.
  3. B. E. Stein and T. R. Stanford, “Multisensory integration: Current issues from the perspective of the single neuron,” Nat. Rev. Neurosci., vol. 9, no. 4, pp. 255–266, 2008.
  4. T. Koelewijn, A. Bronkhorst, and J. Theeuwes, “Attention and the multiple stages of multisensory integration: A review of audiovisual studies,” Acta Psychol., vol. 134, no. 3, pp. 372–384, 2010.
  5. Y. Aytar, C. Vondrick, and A. Torralba, “SoundNet: Learning sound representations from unlabeled video,” in Proc. Adv. Neural Inf. Process. Syst., 2016, pp. 892–900.
  6. R. Arandjelović and A. Zisserman, “Look, listen and learn,” in Proc. IEEE Int. Conf. Comput. Vis., 2017, pp. 609–617.
  7. B. Korbar, D. Tran, and L. Torresani, “Cooperative learning of audio and video models from self-supervised synchronization,” in Proc. Adv. Neural Inf. Process. Syst., 2018, pp. 7763–7774.
  8. R. Arandjelović and A. Zisserman, “Objects that sound,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 435–451.
  9. D. Surís, A. Duarte, A. Salvador, J. Torres, and X. Giró-i-Nieto, “Cross-modal embeddings for video and audio retrieval,” in Proc. Eur. Conf. Comput. Vis. Workshops, 2018.
  10. V. Gabeur, C. Sun, K. Alahari, and C. Schmid, “Multi-modal transformer for video retrieval,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 214–229.
  11. Y. Chen, Y. Xian, A. Koepke, Y. Shan, and Z. Akata, “Distilling audio-visual knowledge by compositional contrastive learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 7016–7025.
  12. W. Hao, Z. Zhang, and H. Guan, “CMCGAN: A uniform framework for cross-modal visual-audio mutual generation,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 6886–6893.
  13. Y. Zhou, Z. Wang, C. Fang, T. Bui, and T. L. Berg, “Visual to sound: Generating natural sound for videos in the wild,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 3550–3558.
  14. H.-H. Wu, P. Seetharaman, K. Kumar, and J. P. Bello, “Wav2CLIP: Learning robust audio representations from CLIP,” in Proc. IEEE Int. Conf. Acoust. Speech Signal Process., 2022, pp. 4563–4567.
  15. J. R. Hershey and J. R. Movellan, “Audio-Vision: Using audio-visual synchrony to locate sounds,” in Proc. Adv. Neural Inf. Process. Syst., 1999, pp. 813–819.
  16. A. Senocak, T.-H. Oh, J. Kim, M.-H. Yang, and I. So Kweon, “Learning to localize sound source in visual scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4358–4366.
  17. D. Hu, F. Nie, and X. Li, “Deep multimodal clustering for unsupervised audiovisual learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2019, pp. 9248–9257.
  18. R. Qian, D. Hu, H. Dinkel, M. Wu, N. Xu, and W. Lin, “Multiple sound sources localization from coarse to fine,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 292–308.
  19. H. Chen, W. Xie, T. Afouras, A. Nagrani, A. Vedaldi, and A. Zisserman, “Localizing visual sounds the hard way,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 16 867–16 876.
  20. Z. Song, Y. Wang, J. Fan, T. Tan, and Z. Zhang, “Self-supervised predictive learning: A negative-free method for sound source localization in visual scenes,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 3222–3231.
  21. A. Owens and A. A. Efros, “Audio-visual scene analysis with self-supervised multisensory features,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 631–648.
  22. R. Gao, R. Feris, and K. Grauman, “Learning to separate object sounds by watching unlabeled video,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 35–53.
  23. H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba, “The sound of pixels,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 570–586.
  24. R. Gao and K. Grauman, “Co-separating sounds of visual objects,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 3879–3888.
  25. X. Xu, B. Dai, and D. Lin, “Recursive visual sound separation using minus-plus net,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 882–891.
  26. H. Zhao, C. Gan, W.-C. Ma, and A. Torralba, “The sound of motions,” in Proc. IEEE Int. Conf. Comput. Vis., 2019, pp. 1735–1744.
  27. C. Gan, D. Huang, H. Zhao, J. B. Tenenbaum, and A. Torralba, “Music gesture for visual sound separation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 10 478–10 487.
  28. L. Zhu and E. Rahtu, “Visually guided sound source separation using cascaded opponent filter network,” in Proc. Asian Conf. Comput. Vis., 2020.
  29. M. Chatterjee, J. Le Roux, N. Ahuja, and A. Cherian, “Visual scene graphs for audio source separation,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 1204–1213.
  30. M. Chatterjee, N. Ahuja, and A. Cherian, “Learning audio-visual dynamics using scene graphs for audio source separation,” in Proc. Adv. Neural Inf. Process. Syst., 2022.
  31. H. Wang, W. Liang, L. V. Gool, and W. Wang, “Towards versatile embodied navigation,” in Proc. Adv. Neural Inf. Process. Syst., vol. 35, 2022, pp. 36 858–36 874.
  32. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. Int. Conf. Med. Image Comput. Comput.-Assisted Intervention, 2015, pp. 234–241.
  33. D. Hu, Z. Wang, H. Xiong, D. Wang, F. Nie, and D. Dou, “Curriculum audiovisual learning,” arXiv:2001.09414, 2020. [Online]. Available: https://arxiv.org/abs/2001.09414
  34. E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM: Visual reasoning with a general conditioning layer,” in Proc. AAAI Conf. Artif. Intell., 2018, pp. 3942–3951.
  35. R. P. N. Rao and D. H. Ballard, “Predictive coding in the visual cortex: A functional interpretation of some extra-classical receptive-field effects,” Nature Neurosci., vol. 2, no. 1, pp. 79–87, 1999.
  36. K. Friston, “A theory of cortical responses,” Philos. Trans. R. Soc. Lond. B Biol. Sci., vol. 360, no. 1456, pp. 815–836, 2005.
  37. M. W. Spratling, “A review of predictive coding algorithms,” Brain Cogn., vol. 112, pp. 92–97, 2017.
  38. H. Wen, K. Han, J. Shi, Y. Zhang, E. Culurciello, and Z. Liu, “Deep predictive coding network for object recognition,” in Proc. Int. Conf. Mach. Learn., 2018, pp. 5266–5275.
  39. Z. Song, J. Zhang, G. Shi, and J. Liu, “Fast inference predictive coding: A novel model for constructing deep neural networks,” IEEE Trans. Neural Netw. Learn. Syst., vol. 30, no. 4, pp. 1150–1165, 2018.
  40. S. Haykin and Z. Chen, “The cocktail party problem,” Neural Comput., vol. 17, no. 9, pp. 1875–1902, 2005.
  41. D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in Proc. Adv. Neural Inf. Process. Syst., 2000, pp. 556–562.
  42. P. Smaragdis and J. C. Brown, “Non-negative matrix factorization for polyphonic music transcription,” in Proc. IEEE Workshop Appl. Signal Process. Audio Acoust., 2003, pp. 177–180.
  43. M. N. Schmidt and R. K. Olsson, “Single-channel speech separation using sparse non-negative matrix factorization,” in Proc. INTERSPEECH, 2006, pp. 2614–2617.
  44. T. Virtanen, “Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria,” IEEE Trans. Audio, Speech, Language Process., vol. 15, no. 3, pp. 1066–1074, 2007.
  45. M. Spiertz and V. Gnann, “Source-filter based clustering for monaural blind source separation,” in Proc. Int. Conf. Digital Audio Effects, 2009.
  46. P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Deep learning for monaural speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2014, pp. 1562–1566.
  47. J. R. Hershey, Z. Chen, J. Le Roux, and S. Watanabe, “Deep clustering: Discriminative embeddings for segmentation and separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 31–35.
  48. P. Chandna, M. Miron, J. Janer, and E. Gómez, “Monoaural audio source separation using deep convolutional neural networks,” in Proc. ICLVASS, 2017, pp. 258–266.
  49. D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2017, pp. 241–245.
  50. D. Stoller, S. Ewert, and S. Dixon, “Adversarial semi-supervised audio source separation applied to singing voice extraction,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 2391–2395.
  51. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 27, no. 8, pp. 1256–1266, 2019.
  52. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 26, no. 10, pp. 1702–1726, 2018.
  53. D. Michelsanti, Z.-H. Tan, S.-X. Zhang, Y. Xu, M. Yu, D. Yu, and J. Jensen, “An overview of deep-learning-based audio-visual speech enhancement and separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 29, pp. 1368–1396, 2021.
  54. A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. T. Freeman, and M. Rubinstein, “Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation,” ACM Trans. Graph., vol. 37, no. 4, pp. 109:1–109:11, 2018.
  55. A. Gabbay, A. Ephrat, T. Halperin, and S. Peleg, “Seeing through noise: Visually driven speaker separation and enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2018, pp. 3051–3055.
  56. T. Afouras, J. S. Chung, and A. Zisserman, “The conversation: Deep audio-visual speech enhancement,” in Proc. INTERSPEECH, 2018, pp. 3244–3248.
  57. R. Lu, Z. Duan, and C. Zhang, “Listen and look: Audio–visual matching assisted speech source separation,” IEEE Signal Process. Lett., vol. 25, no. 9, pp. 1315–1319, 2018.
  58. G. Morrone, S. Bergamaschi, L. Pasa, L. Fadiga, V. Tikhanoff, and L. Badino, “Face landmark-based speaker-independent audio-visual speech enhancement in multi-talker environments,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 6900–6904.
  59. A. Owens, J. Wu, J. H. McDermott, W. T. Freeman, and A. Torralba, “Ambient sound provides supervision for visual learning,” in Proc. Eur. Conf. Comput. Vis., 2016, pp. 801–816.
  60. H. Alwassel, D. Mahajan, B. Korbar, L. Torresani, B. Ghanem, and D. Tran, “Self-supervised learning by cross-modal audio-video clustering,” in Proc. Adv. Neural Inf. Process. Syst., 2020, pp. 9758–9770.
  61. J.-B. Alayrac, A. Recasens, R. Schneider, R. Arandjelović, J. Ramapuram, J. De Fauw, L. Smaira, S. Dieleman, and A. Zisserman, “Self-supervised multimodal versatile networks,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 25–37.
  62. T. Afouras, A. Owens, J. S. Chung, and A. Zisserman, “Self-supervised learning of audio-visual objects from video,” in Proc. Eur. Conf. Comput. Vis., 2020, pp. 208–224.
  63. H. Zhu, M.-D. Luo, R. Wang, A.-H. Zheng, and R. He, “Deep audio-visual learning: A survey,” International Journal of Automation and Computing, vol. 18, no. 3, pp. 351–376, 2021.
  64. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. H. Richemond, E. Buchatskaya, C. Doersch, B. A. Pires, Z. D. Guo, M. G. Azar, B. Piot, K. Kavukcuoglu, R. Munos, and M. Valko, “Bootstrap your own latent: A new approach to self-supervised learning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 33, 2020, pp. 21 271–21 284.
  65. X. Chen and K. He, “Exploring simple Siamese representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 15 750–15 758.
  66. V. Nair and G. E. Hinton, “Rectified linear units improve restricted Boltzmann machines,” in Proc. Int. Conf. Mach. Learn., 2010, pp. 807–814.
  67. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
  68. A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier nonlinearities improve neural network acoustic models,” in Proc. Int. Conf. Mach. Learn. Workshop, 2013.
  69. C. Laurent, G. Pereyra, P. Brakel, Y. Zhang, and Y. Bengio, “Batch normalized recurrent neural networks,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2016, pp. 2657–2661.
  70. Y. Tian, X. Chen, and S. Ganguli, “Understanding self-supervised learning dynamics without contrastive pairs,” in Proc. Int. Conf. Mach. Learn., 2021, pp. 10 268–10 278.
  71. B. Li, X. Liu, K. Dinesh, Z. Duan, and G. Sharma, “Creating a multitrack classical music performance dataset for multimodal music analysis: Challenges, insights, and applications,” IEEE Trans. Multimedia, vol. 21, no. 2, pp. 522–535, 2019.
  72. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proc. Int. Conf. Learn. Represent., 2019.
  73. Z. Wang, E. P. Simoncelli, and A. C. Bovik, “Multiscale structural similarity for image quality assessment,” in Proc. Asilomar Conf. Signals, Syst. Comput., 2003, pp. 1398–1402.
  74. L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, pp. 2579–2605, 2008.
  75. A. Kamath, M. Singh, Y. LeCun, G. Synnaeve, I. Misra, and N. Carion, “MDETR-Modulated detection for end-to-end multi-modal understanding,” in Proc. IEEE Int. Conf. Comput. Vis., 2021, pp. 1780–1790.
  76. A. Rouditchenko, H. Zhao, C. Gan, J. McDermott, and A. Torralba, “Self-supervised audio-visual co-segmentation,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 2357–2361.
  77. Z. Song, O. Koyejo, and J. Zhang, “Learning controllable disentangled representations with decorrelation regularization,” arXiv preprint arXiv:1912.11675, 2019. [Online]. Available: https://arxiv.org/abs/1912.11675
  78. Z. Song, O. Koyejo, and J. Zhang, “Toward a controllable disentanglement network,” IEEE Trans. Cybern., vol. 52, no. 4, pp. 2491–2504, 2020.
  79. J. Chen, R. Zhang, D. Lian, J. Yang, Z. Zeng, and J. Shi, “iQuery: Instruments as queries for audio-visual sound separation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
  80. R. Gao, Y. Dou, H. Li, T. Agarwal, J. Bohg, Y. Li, L. Fei-Fei, and J. Wu, “The ObjectFolder benchmark: Multisensory learning with neural and real objects,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Zengjie Song (7 papers)
  2. Zhaoxiang Zhang (161 papers)
Citations (1)