Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Visual Echoes: A Simple Unified Transformer for Audio-Visual Generation (2405.14598v2)

Published 23 May 2024 in cs.CV, cs.LG, cs.MM, cs.SD, and eess.AS

Abstract: In recent years, with the realistic generation results and a wide range of personalized applications, diffusion-based generative models gain huge attention in both visual and audio generation areas. Compared to the considerable advancements of text2image or text2audio generation, research in audio2visual or visual2audio generation has been relatively slow. The recent audio-visual generation methods usually resort to huge LLM or composable diffusion models. Instead of designing another giant model for audio-visual generation, in this paper we take a step back showing a simple and lightweight generative transformer, which is not fully investigated in multi-modal generation, can achieve excellent results on image2audio generation. The transformer operates in the discrete audio and visual Vector-Quantized GAN space, and is trained in the mask denoising manner. After training, the classifier-free guidance could be deployed off-the-shelf achieving better performance, without any extra training or modification. Since the transformer model is modality symmetrical, it could also be directly deployed for audio2image generation and co-generation. In the experiments, we show that our simple method surpasses recent image2audio generation methods. Generated audio samples can be found at https://docs.google.com/presentation/d/1ZtC0SeblKkut4XJcRaDsSTuCRIXB3ypxmSi7HTY3IyQ/

Citations (1)

Summary

  • The paper presents a compact transformer model that achieves superior image-to-audio generation, with an FAD of 1.29 outperforming baseline techniques.
  • It employs pretrained VQGANs for dual-mode tokenization and a full-attention transformer with mask denoising and classifier-free guidance to optimize performance.
  • Its versatility in handling audio-to-image and joint audio-image co-generation signals significant potential for real-time, cross-modal AI applications.

Exploring Lightweight Multimodal Generative AI with Transformers

Introduction

Generative AI has seen impressive advancements in creating content from various data modalities, particularly text and images. However, cross-modal generation, such as converting images to audio or vice versa, has not kept up the same pace. This paper explores this less explored area by proposing a simple generative transformer model for multimodal generation tasks, specifically focusing on image-to-audio (image2audio) generation.

Key Contributions

Here are the primary takeaways from this research:

  • Compact Model: Unlike some existing methods requiring gigantic models, this paper demonstrates that a lightweight generative transformer can effectively perform image2audio generation.
  • Versatility: The model can also handle audio-to-image (audio2image) generation and joint audio-image co-generation without retraining.
  • Performance: The proposed method surpasses other state-of-the-art techniques in image2audio generation, even outperforming models based on diffusion or autoregressive transformers.

The Proposed Method

Data Preparation and Tokenization

The researchers used the VGGSound dataset, which contains paired audio and visual data from videos. The data preparation includes:

  1. Sampling Frames: Uniformly sampling 10 frames per video.
  2. Mel Spectrograms: Converting 10-second audio clips to mel spectrograms.

For tokenization, they employed pretrained Vector-Quantized GANs (VQGANs) for both images and audio, translating these modalities into discrete token sequences. This allows the transformer to operate in a compact latent space, significantly speeding up training.

Generative Transformer

The backbone of the model is a full-attention transformer divided into encoder and decoder parts, inspired by the Vision Transformer (ViT). The training involves a mask denoising task, where a certain ratio of tokens is masked, and the model learns to predict these masked tokens.

Inference and Classifier-Free Guidance

During inference, the model uses an iterative unmasking strategy to generate content, either for image2audio, audio2image, or co-generation. Furthermore, they incorporated classifier-free guidance (CFG) without additional training, enhancing performance.

Experimental Results

Image2Audio Generation

The paper compares their method with state-of-the-art techniques like Im2Wav and DiffFoley. The main evaluation metrics are Frechet Audio Distance (FAD), Frechet Distance (FD), and Inception Score (IS). Here's a snapshot of the results:

  • FAD: Lower FAD indicates better generation quality. The proposed method with CFG achieves an FAD of 1.29, better than the baselines.
  • FD: The new model also leads with an FD of 14.79.
  • IS: Achieving a high Inception Score of 12.06 with CFG, the model demonstrates good quality and diversity.

Visual Results and General Insights

Although the model primarily focuses on image2audio, it also performs well in audio2image tasks. The guided inference ensures that generated audio aligns well with visual inputs.

Implications and Future Work

The paper's key outcome is that a simple transformer-based approach can handle complex cross-modal generation tasks effectively. This opens several avenues for future research:

  1. Improved VQGANs: Utilizing more robust VQGANs could further enhance the quality of generated images and audio.
  2. Broader Applications: The versatility of this approach could extend to other cross-modal tasks like text-to-image or text-to-audio.
  3. Efficiency: The lightweight nature of the proposed method makes it suitable for real-time applications, potentially transforming areas like augmented reality and interactive media experiences.

Conclusion

This paper highlights a scalable and effective approach to multimodal generation tasks using a generative transformer. The strong performance and simplicity of the proposed model make it a promising candidate for future research and practical applications in multimodal AI.