GACELA -- A generative adversarial context encoder for long audio inpainting (2005.05032v1)

Published 11 May 2020 in cs.SD, eess.AS, and stat.ML

Abstract: We introduce GACELA, a generative adversarial network (GAN) designed to restore missing musical audio data with a duration ranging between hundreds of milliseconds to a few seconds, i.e., to perform long-gap audio inpainting. While previous work either addressed shorter gaps or relied on exemplars by copying available information from other signal parts, GACELA addresses the inpainting of long gaps in two aspects. First, it considers various time scales of audio information by relying on five parallel discriminators with increasing resolution of receptive fields. Second, it is conditioned not only on the available information surrounding the gap, i.e., the context, but also on the latent variable of the conditional GAN. This addresses the inherent multi-modality of audio inpainting at such long gaps and provides the option of user-defined inpainting. GACELA was tested in listening tests on music signals of varying complexity and gap durations ranging from 375~ms to 1500~ms. While our subjects were often able to detect the inpaintings, the severity of the artifacts decreased from unacceptable to mildly disturbing. GACELA represents a framework capable to integrate future improvements such as processing of more auditory-related features or more explicit musical features.

Citations (41)

View on Semantic Scholar

Summary

The paper introduces a conditional GAN integrating five parallel discriminators to address long audio inpainting for gaps ranging from 375 ms to 1500 ms.
It leverages context encoding and multi-scale discrimination, enabling tailored restoration of missing audio segments while capturing temporal dependencies.
Listening tests demonstrated reduced perceptual artifacts, confirming the method's effectiveness for real-world audio restoration applications.

Overview of GACELA: A Generative Adversarial Context Encoder for Long Audio Inpainting

The paper presents GACELA, a sophisticated generative adversarial network (GAN) architecture designed for the inpainting of long audio segments. This approach addresses the restoration of missing audio data in periods ranging from hundreds of milliseconds up to a few seconds, tackling the challenges of audio inpainting in the presence of longer gaps. Unlike previous methodologies which either focused on the inpainting of shorter gaps or replicated available signal patterns to replace missing data, GACELA introduces a novel approach by leveraging diverse time scales using multiple discriminators and context encoding in conjunction with conditional GANs.

Key Contributions

Multi-Scale Discrimination: The paper emphasizes the importance of utilizing multiple discriminators that operate across various time scales. GACELA's architecture incorporates five parallel discriminators, each trained to focus on different resolutions of receptive fields. This not only ensures a more holistic consideration of the audio but also encourages the generator to account for both short- and long-term temporal dependencies.
Conditioned Generative Model: GACELA is formulated as a conditional GAN. It is conditioned on both the context surrounding a gap and a latent variable, which captures the inherent multi-modality of the audio inpainting task. This conditioning enables user-defined inpainting that tailors the restoration process to specific needs, accommodating diverse potential outcomes for the same inpainting scenario.
Evaluation With Listening Tests: The restored audio was evaluated through listening tests, which involved participants rating the perceptual quality of the inpainted segments. Gap durations ranged between 375 ms and 1500 ms, showing that while inpainting was often detectable, the severity of artifacts decreased considerably, indicative of GACELA’s effectiveness over varying complexities of musical signals.

Implications and Future Directions

The development of GACELA has significant implications for both practical applications and theoretical advancements in audio signal processing:

Practical Applications: This method could be highly beneficial in scenarios like music streaming, live audio communications, and restoration of vintage recordings. GACELA's ability to handle long gaps offers a more realistic solution for real-world audio corruption problems, where shorter inpainting may not suffice.
Improving Audio Synthesis: The application of GANs to the problem of audio inpainting represents a critical fusion of deep learning techniques with audio processing, suggesting the potential for GANs to be a valuable tool in audio synthesis tasks beyond inpainting.
Theoretical Advancements: The paper opens up avenues for enhancing GAN architectures through the integration of auditory models like Audlet frames, aiming to better mimic human audio perception in machine learning frameworks.

Conclusion

GACELA is a notable advancement for long-gap audio inpainting, addressing issues of multi-modality and temporal dependencies through its innovative GAN-based architecture. The encouraging results from listening tests suggest that GACELA is capable of producing perceptually coherent audio in realistic situations of large-scale data loss. Future work could focus on extending GACELA's abilities to greater gap durations and incorporating more sophisticated auditory models to further enhance artifact detection thresholds and audio quality.

PDF Markdown

Related Papers

Diffusion-Based Audio Inpainting (2023)
Audio-Visual Speech Inpainting with Deep Learning (2020)
Audio inpainting with generative adversarial network (2020)
Deep Long Audio Inpainting (2019)
Audio inpainting of music by means of neural networks (2018)

Tweets

https://twitter.com/andi_marafioti/status/1798353402906456511