T-FOLEY: A Controllable Waveform-Domain Diffusion Model for Temporal-Event-Guided Foley Sound Synthesis (2401.09294v1)
Abstract: Foley sound, audio content inserted synchronously with videos, plays a critical role in the user experience of multimedia content. Recently, there has been active research in Foley sound synthesis, leveraging the advancements in deep generative models. However, such works mainly focus on replicating a single sound class or a textual sound description, neglecting temporal information, which is crucial in the practical applications of Foley sound. We present T-Foley, a Temporal-event-guided waveform generation model for Foley sound synthesis. T-Foley generates high-quality audio using two conditions: the sound class and temporal event feature. For temporal conditioning, we devise a temporal event feature and a novel conditioning technique named Block-FiLM. T-Foley achieves superior performance in both objective and subjective evaluation metrics and generates Foley sound well-synchronized with the temporal events. Additionally, we showcase T-Foley's practical applications, particularly in scenarios involving vocal mimicry for temporal event control. We show the demo on our companion website.
- “Foley sound synthesis at the DCASE 2023 challenge,” in Proceedings of the 8th Detection and Classification of Acoustic Scenes and Events 2023 Workshop (DCASE2023), 2023, pp. 16–20.
- “Neural synthesis of footsteps sound effects with generative adversarial networks,” in Audio Engineering Society Convention 152, May 2022.
- “Generating diverse realistic laughter for interactive art,” arXiv preprint arXiv:2111.03146, 2021.
- “DRUMGAN: synthesis of drum sounds with timbral feature conditioning using generative adversarial networks,” in Proc. of the 21th Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2020, pp. 590–597.
- “Crash: raw audio score-based generative modeling for controllable high-resolution drum sound synthesis,” in Proc. of the Int. Soc. for Music Information Retrieval Conf. (ISMIR), 2021, pp. 579–585.
- “Autofoley: Artificial synthesis of synchronized sound tracks for silent videos with deep learning,” IEEE Transactions on Multimedia, vol. 23, pp. 1895–1907, 2020.
- “Conditional generation of audio from video via foley analogies,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 2426–2436.
- “Conditional sound generation using neural discrete time-frequency representation learning,” in IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), 2021, pp. 1–6.
- “Foleygan: Visually guided generative adversarial network-based synchronous sound generation in silent videos,” IEEE Transactions on Multimedia, 2022.
- “Generating visually aligned sound from videos,” IEEE Transactions on Image Processing, vol. 29, pp. 8292–8302, 2020.
- “Full-band general audio synthesis with score-based diffusion,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Diffsound: Discrete diffusion model for text-to-sound generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- “Audioldm: Text-to-audio generation with latent diffusion models,” International Conference on Machine Learning, 2023.
- “Audiogen: Textually guided audio generation,” in The Eleventh International Conference on Learning Representations, 2022.
- “Varietysound: Timbre-controllable video to sound generation via unsupervised information disentanglement,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5.
- “Film: Visual reasoning with a general conditioning layer,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2018, vol. 32.
- “Wavegrad: Estimating gradients for waveform generation,” in International Conference on Learning Representations, 2020.
- “Temporal film: Capturing long-range sequence dependencies with feature-wise modulations.,” Advances in Neural Information Processing Systems, vol. 32, 2019.
- “Vocal Imitation Set: a dataset of vocally imitated sound events using the audioset ontology,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2018 Workshop (DCASE2018), 2018, pp. 148–152.
- “Vocalsketch: Vocally imitating audio concepts,” in Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems, 2015, pp. 43–46.
- “Variational diffusion models,” Advances in neural information processing systems, vol. 34, pp. 21696–21707, 2021.
- “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- “CNN architectures for large-scale audio classification,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 131–135.
- “Panns: Large-scale pretrained audio neural networks for audio pattern recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 2880–2894, 2020.