Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator (2504.18283v1)

Published 25 Apr 2025 in cs.CV, cs.AI, cs.MM, cs.SD, and eess.AS

Abstract: Recent audio-visual generative models have made substantial progress in generating images from audio. However, existing approaches focus on generating images from single-class audio and fail to generate images from mixed audio. To address this, we propose an Audio-Visual Generation and Separation model (AV-GAS) for generating images from soundscapes (mixed audio containing multiple classes). Our contribution is threefold: First, we propose a new challenge in the audio-visual generation task, which is to generate an image given a multi-class audio input, and we propose a method that solves this task using an audio-visual separator. Second, we introduce a new audio-visual separation task, which involves generating separate images for each class present in a mixed audio input. Lastly, we propose new evaluation metrics for the audio-visual generation task: Class Representation Score (CRS) and a modified R@K. Our model is trained and evaluated on the VGGSound dataset. We show that our method outperforms the state-of-the-art, achieving 7% higher CRS and 4% higher R@2* in generating plausible images with mixed audio.

Summary

The paper introduces AV-GAS, a novel Audio-Visual Generation and Separation model capable of creating composite or distinct images from complex, mixed audio soundscapes.
AV-GAS utilizes an Audio-Visual Separator to distinguish audio sources and an Image Generator leveraging BigGAN to synthesize visuals, trained on VGGSound data.
The model achieves notable improvements in capturing represented classes (7% CRS, 4% R@2*) compared to state-of-the-art methods and has potential applications in surveillance and multimedia creation.

Overview of Seeing Soundscapes: Audio-Visual Generation and Separation from Soundscapes Using Audio-Visual Separator

This paper introduces an innovative approach titled the Audio-Visual Generation and Separation model (AV-GAS) designed to overcome the limitations of existing methods in audio-to-image generation. Specifically, conventional models focus on generating images based on single-class audio inputs. This research addresses a more challenging yet practical scenario: generating images from soundscapes, which comprise mixed audio inputs containing multiple classes. AV-GAS not only facilitates the generation of composite images when given mixed audio but also excels in audio-visual separation by producing distinct images for each class in mixed audio inputs.

Methodology

The core contribution of the paper is a novel architecture comprising two integrated modules: the audio-visual separator and the image generator.

Audio-Visual Separator: It utilizes a ResNet-18 backbone to segregate mixed audio inputs into distinct audio-visual feature representations. The separator aligns these features with corresponding ground truth audio and image embeddings through contrastive learning, utilizing InfoNCE loss for both audio-to-audio (A2A) and audio-to-visual (A2V) alignments.
Image Generator: This generative component produces images either as a single composite or as separate visuals for each class in mixed audio. Leveraging BigGAN architecture with the ICGAN approach, the generator combines or selectively uses the audio-visual features to synthesize the desired images.

The model is trained with audio-visual input samples extracted from the VGGSound dataset, simulating realistic sound environments by manipulating combinations of sound classes.

Quantitative and Qualitative Assessment

The effectiveness of AV-GAS is rigorously evaluated with several key metrics:

Class Representation Score (CRS) and *R@2**: These metrics are tailored to measure the success of capturing all represented classes in mixed audio and are crucial for validating the paper's claims. AV-GAS demonstrates a remarkable 7% improvement in CRS and achieves 4% higher R@2* in comparison with existing state-of-the-art models.
Inception Score (IS) and Fréchet Inception Distance (FID): While these indicators primarily assess the quality of generated images, AV-GAS notes competitive performance, highlighting robust image synthesis fidelity alongside its primary task-focused successes.

Qualitative outputs showcase the model's proficiency in generating plausible scenes in both realistic and unrealistic contexts, a test of its versatility and contextual understanding derived from complex soundscapes.

Implications and Future Directions

The research importantly shifts the frontier of audio-visual synthesis by enabling machines to interpret and generate visual scenes from complicated auditory inputs, a step towards more integrative AI perception systems. Practically, these advances could enhance applications in surveillance, multimedia creation, and real-time virtual environments, ensuring enriched interactivity and contextual representation.

The paper proposes expanding the model to handle more than two audio streams simultaneously, aiming for better generalizability beyond current background/foreground classifications. Further refinement of embedding location strategies is anticipated to improve multi-source audio processing akin to tasks in music source separation.