Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
36 tokens/sec
GPT-4o
12 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Aligning Audio-Visual Joint Representations with an Agentic Workflow (2410.23230v2)

Published 30 Oct 2024 in cs.CV, cs.AI, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: Visual content and accompanied audio signals naturally formulate a joint representation to improve audio-visual (AV) related applications. While studies develop various AV representation learning frameworks, the importance of AV data alignment is usually undermined for achieving high-quality representation. We observe that an audio signal may contain background noise interference. Also, non-synchronization may appear between audio and video streams. These non-strict data alignment limits representation quality and downgrade application performance. In this paper, we propose to improve AV joint representations from a data-centric perspective by aligning audio signals to visual data. Our alignment is conducted in an agentic workflow controlled by an LLM-based assistant named AVAgent. For each input AV data pair, our AVAgent uses a multi-modal LLM to convert audio and visual data into language descriptions separately (i.e., tool use). Then, AVAgent reasons whether this paired data is aligned well and plans to edit the audio signal if needed (i.e., planning). The audio editing is executed by predefined actions that filter noise or augment data. Moreover, we use a VLM to evaluate how modified audio signals match the visual content and provide feedback to AVAgent (i.e., reflection). The tool use, planning, and reflection steps operate cyclically to become an agentic workflow where audio signals are gradually aligned to visual content. To this end, existing methods can directly leverage the aligned AV data via our agentic workflow to improve AV joint representations. The experimental results comprehensively demonstrate the state-of-the-art performance of the proposed approach against previous baselines in diverse downstream tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Soundnet: Learning sound representations from unlabeled video. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2016.
  2. Ambient sound provides supervision for visual learning. In Proceedings of the European Conference on Computer Vision (ECCV), pages 801–816, 2016.
  3. Look, listen and learn. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 609–617, 2017.
  4. The sound of pixels. In Proceedings of the European Conference on Computer Vision (ECCV), pages 570–586, 2018.
  5. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1735–1744, 2019.
  6. Music gesture for visual sound separation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10478–10487, 2020.
  7. Learning representations from audio-visual spatial alignment. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 4733–4744, 2020.
  8. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12934–12945, 2021.
  9. Cooperative learning of audio and video models from self-supervised synchronization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2018.
  10. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4358–4366, 2018.
  11. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12475–12486, June 2021.
  12. Audio-visual sound separation via hidden markov models. Advances in Neural Information Processing Systems, 14, 2001.
  13. Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. arXiv preprint arXiv:1804.03619, 2018.
  14. Deep multimodal clustering for unsupervised audiovisual learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 9248–9257, 2019.
  15. Audio-visual event localization in unconstrained videos. In Proceedings of European Conference on Computer Vision (ECCV), 2018.
  16. Dual-modality seq2seq network for audio-visual event localization. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2002–2006, 2019.
  17. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), pages 6291–6299, 2019.
  18. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision (ACCV), 2020.
  19. Self-supervised generation of spatial audio for 360 video. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
  20. 2.5d visual sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 324–333, 2019.
  21. Soundspaces: Audio-visual navigation in 3d environments. In Proceedings of European Conference on Computer Vision (ECCV), pages 17–36, 2020.
  22. Learning to set waypoints for audio-visual navigation. In Proceedings of International Conference on Learning Representations (ICLR), 2021.
  23. Soundspaces 2.0: A simulation platform for visual-acoustic learning. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2022.
  24. Unified multisensory perception: Weakly-supervised audio-visual video parsing. In Proceedings of European Conference on Computer Vision (ECCV), page 436–454, 2020.
  25. Yu Wu and Yi Yang. Exploring heterogeneous clues for weakly-supervised audio-visual video parsing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1326–1335, 2021.
  26. Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2021.
  27. Multi-modal grouping network for weakly-supervised audio-visual video parsing. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  28. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  29. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  30. Aesopagent: Agent-driven evolutionary system on story-to-video production. arXiv preprint arXiv:2403.07952, 2024.
  31. Videoagent: A memory-augmented multimodal agent for video understanding. arXiv preprint arXiv:2403.11481, 2024.
  32. A unified audio-visual learning framework for localization, separation, and recognition. In Proceedings of the International Conference on Machine Learning (ICML), 2023.
  33. Video-llava: Learning united visual representation by alignment before projection. arXiv preprint arXiv:2311.10122, 2023.
  34. Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2024.
  35. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023.
  36. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  37. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
  38. X-clip: End-to-end multi-grained contrastive learning for video-text retrieval. In Proceedings of ACM International Conference on Multimedia (ACMMM), 2022.
  39. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020.
  40. Mix and localize: Localizing sound sources in mixtures. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10483–10492, 2022.
  41. A closer look at weakly-supervised audio-visual source localization. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), 2022.
  42. Audio-visual segmentation. In Proceedings of European Conference on Computer Vision (ECCV), 2022.
  43. Masked autoencoders are scalable vision learners. arXiv preprint arXiv:2111.06377, 2021.
  44. Masked autoencoders that listen. In Proceedings of Advances In Neural Information Processing Systems (NeurIPS), 2022.
  45. Contrastive audio-visual masked autoencoder. In Proceedings of The Eleventh International Conference on Learning Representations (ICLR), 2023.
  46. Mavil: Masked audio-video learners. arXiv preprint arXiv:2212.08071, 2022.
  47. Audiovisual masked autoencoders. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16144–16154, October 2023.
  48. Localizing visual sounds the easy way. In Proceedings of European Conference on Computer Vision (ECCV), page 218–234, 2022.
  49. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  50. Objects that sound. In Proceedings of the European Conference on Computer Vision (ECCV), pages 435–451, 2018.
  51. Multiple sound sources localization from coarse to fine. In Proceedings of European Conference on Computer Vision (ECCV), pages 292–308, 2020.
  52. Discriminative sounding objects localization via self-supervised audiovisual matching. In Proceedings of Advances in Neural Information Processing Systems (NeurIPS), pages 10077–10087, 2020.
  53. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16867–16876, 2021.
  54. Tuomas Virtanen. Monaural sound source separation by nonnegative matrix factorization with temporal continuity and sparseness criteria. IEEE Transactions on Audio, Speech, and Language Processing, 15(3):1066–1074, 2007.
  55. Singing-voice separation from monaural recordings using robust principal component analysis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 57–60, 2012.
  56. Recursive visual sound separation using minus-plus net. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2019.
  57. Cyclic co-learning of sounding object visual grounding and sound separation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2745–2754, 2021.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (2)