Foley Sound Synthesis at the DCASE 2023 Challenge (2304.12521v4)
Abstract: The addition of Foley sound effects during post-production is a common technique used to enhance the perceived acoustic properties of multimedia content. Traditionally, Foley sound has been produced by human Foley artists, which involves manual recording and mixing of sound. However, recent advances in sound synthesis and generative models have generated interest in machine-assisted or automatic Foley synthesis techniques. To promote further research in this area, we have organized a challenge in DCASE 2023: Task 7 - Foley Sound Synthesis. Our challenge aims to provide a standardized evaluation framework that is both rigorous and efficient, allowing for the evaluation of different Foley synthesis systems. We received 17 submissions, and performed both objective and subjective evaluation to rank them according to three criteria: audio quality, fit-to-category, and diversity. Through this challenge, we hope to encourage active participation from the research community and advance the state-of-the-art in automatic Foley synthesis. In this technical report, we provide a detailed overview of the Foley sound synthesis challenge, including task definition, dataset, baseline, evaluation scheme and criteria, challenge result, and discussion.
- A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning. PMLR, 2021, pp. 8821–8831.
- OpenAI, “GPT-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- M. Pasini and J. Schlüter, “Musika! fast infinite waveform music generation,” in ISMIR 2022 Hybrid Conference, 2022.
- A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, et al., “MusicLM: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
- C. Donahue, J. McAuley, and M. Puckette, “Adversarial audio synthesis,” in International Conference on Learning Representations (ICLR), 2019.
- Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi, et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “AudioLDM: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
- K. Choi, S. Oh, M. Kang, and B. McFee, “A proposal for foley sound synthesis challenge,” arXiv preprint arXiv:2207.10760, 2022.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Advances in Neural Information Processing Systems, vol. 33, pp. 17 022–17 033, 2020.
- A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” Advances in neural information processing systems, vol. 30, 2017.
- J. Salamon, C. Jacoby, and J. P. Bello, “A dataset and taxonomy for urban sound research,” in Proceedings of the 22nd ACM international conference on Multimedia, 2014.
- E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “Fsd50k: an open dataset of human-labeled sound events,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 829–852, 2021.
- X. Liu, T. Iqbal, J. Zhao, Q. Huang, M. D. Plumbley, and W. Wang, “Conditional sound generation using neural discrete time-frequency representation learning,” in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2021, pp. 1–6.
- X. Chen, N. Mishra, M. Rohaninejad, and P. Abbeel, “Pixelsnail: An improved autoregressive generative model,” in International Conference on Machine Learning. PMLR, 2018.
- A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for polyphonic sound event detection,” Applied Sciences, 2016.
- J. Vadillo and R. Santana, “On the human evaluation of universal audio adversarial perturbations,” Computers & Security, 2022.
- B. Series, “Method for the subjective assessment of intermediate quality level of audio systems,” International Telecommunication Union Radiocommunication Assembly, 2014.
- K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms.” in INTERSPEECH, 2019.
- A. L. Cramer, H.-H. Wu, J. Salamon, and J. P. Bello, “Look, listen, and learn more: Design choices for deep audio embeddings,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 3852–3856.
- M. Kang, S. Oh, H. Moon, K. Lee, and B. S. Chon, “FALL-E: Gaudio foley synthesis system,” Tech. Rep., June 2023.
- S. Fan, Q. Zhu, F. Xiao, H. Lan, W. Wang, and J. Guan1, “Foley sound synthesis with AudioLDM for dcase2023 task 7,” Tech. Rep., June 2023.
- J. Lee1, H. Nam, and Y.-H. Park, “VIFS: An end-to-end variational inference for foley sound synthesis,” Tech. Rep., 2023.
- R. Scheibler, T. Hasumi, Y. Fujita, T. Komatsu, R. Yamamoto, and K. Tachibana, “Class-conditioned latent diffusion model for DCASE 2023 foley sound synthesis challenge,” Tech. Rep., 2023.
- Y. Yuan, H. Liu, X. Liu, X. Kang, M. D.Plumbley, and W. Wang, “Latent diffusion model based foley sound generation system for dcase challenge 2023 task 7,” June 2023.
- S. Huang, J. Bai, Y. Jia, and J. Chen, “Jless submission to dcase2023 task7: Foley sound synthesis using non-autoagressive generative model,” Tech. Rep., June 2023.
- W.-G. Choi and J.-H. Chang, “HYU submission for the dcase 2023 task 7: Diffusion probabilistic model with adversarial training for foley sound synthesis,” Tech. Rep., June 2023.
- C.-W. Bang, N. K. Kim, and C. Chun, “High-quality foley sound synthesis using monte carlo dropout,” Tech. Rep., 2023.
- Y. Chung, J. Lee, and J. Nam, “Foley sound synthesis in waveform domain with diffusion model,” Tech. Rep., 2023.
- H. C. Chung, Y. Lee, and J. H. Jung, “Foley sound synthesis based on GAN using contrastive learning without label information,” Tech. Rep., June 2023.
- P. Kamath, T. N. Islam, C. Gupta, L. Wyse, and S. Nanayakkara, “Dcase task-7: StyleGAN2-based foley sound synthesis,” Tech. Rep., June 2023.
- K. Kim, J. Lee, H. Kim, and K. Lee, “Conditional foley sound synthesis with limited data: Two-stage data augmentation approach with stylegan2-ada,” Tech. Rep., June 2023.
- A. Pillay, S. Betko, A. Liloia, H. Chen, and A. Shah, “DCASE task 7: Foley sound synthesis,” Tech. Rep., June 2023.
- A. Qi, “Auto-bit for DCASE2023 task7 technical reports: Assemble system of bitdiffusion and PixelSNAIL,” 2023.
- H. Zhang, K. Qian, L. Shen, L. Li, K. Xu, and B. Hu, “From noise to sound: Audio synthesis via diffusion models,” 2023.
- T. Wendner, P. Hu, T. Jadidi, and A. Neuhauser, “Audio diffusion for foley sound synthesis,” Tech. Rep., June 2023.
- Z. Xie, X. Xu, B. Li, M. Wu, and K. Yu, “The X-LANCE system for DCASE2023 challenge task 7: Foley sound synthesis track b,” Tech. Rep., June 2023.
- Keunwoo Choi (42 papers)
- Jaekwon Im (5 papers)
- Laurie Heller (2 papers)
- Brian McFee (22 papers)
- Keisuke Imoto (49 papers)
- Yuki Okamoto (17 papers)
- Mathieu Lagrange (20 papers)
- Shinosuke Takamichi (1 paper)