- The paper presents a compact transformer model that achieves superior image-to-audio generation, with an FAD of 1.29 outperforming baseline techniques.
- It employs pretrained VQGANs for dual-mode tokenization and a full-attention transformer with mask denoising and classifier-free guidance to optimize performance.
- Its versatility in handling audio-to-image and joint audio-image co-generation signals significant potential for real-time, cross-modal AI applications.
Exploring Lightweight Multimodal Generative AI with Transformers
Introduction
Generative AI has seen impressive advancements in creating content from various data modalities, particularly text and images. However, cross-modal generation, such as converting images to audio or vice versa, has not kept up the same pace. This paper explores this less explored area by proposing a simple generative transformer model for multimodal generation tasks, specifically focusing on image-to-audio (image2audio) generation.
Key Contributions
Here are the primary takeaways from this research:
- Compact Model: Unlike some existing methods requiring gigantic models, this paper demonstrates that a lightweight generative transformer can effectively perform image2audio generation.
- Versatility: The model can also handle audio-to-image (audio2image) generation and joint audio-image co-generation without retraining.
- Performance: The proposed method surpasses other state-of-the-art techniques in image2audio generation, even outperforming models based on diffusion or autoregressive transformers.
The Proposed Method
Data Preparation and Tokenization
The researchers used the VGGSound dataset, which contains paired audio and visual data from videos. The data preparation includes:
- Sampling Frames: Uniformly sampling 10 frames per video.
- Mel Spectrograms: Converting 10-second audio clips to mel spectrograms.
For tokenization, they employed pretrained Vector-Quantized GANs (VQGANs) for both images and audio, translating these modalities into discrete token sequences. This allows the transformer to operate in a compact latent space, significantly speeding up training.
Generative Transformer
The backbone of the model is a full-attention transformer divided into encoder and decoder parts, inspired by the Vision Transformer (ViT). The training involves a mask denoising task, where a certain ratio of tokens is masked, and the model learns to predict these masked tokens.
Inference and Classifier-Free Guidance
During inference, the model uses an iterative unmasking strategy to generate content, either for image2audio, audio2image, or co-generation. Furthermore, they incorporated classifier-free guidance (CFG) without additional training, enhancing performance.
Experimental Results
Image2Audio Generation
The paper compares their method with state-of-the-art techniques like Im2Wav and DiffFoley. The main evaluation metrics are Frechet Audio Distance (FAD), Frechet Distance (FD), and Inception Score (IS). Here's a snapshot of the results:
- FAD: Lower FAD indicates better generation quality. The proposed method with CFG achieves an FAD of 1.29, better than the baselines.
- FD: The new model also leads with an FD of 14.79.
- IS: Achieving a high Inception Score of 12.06 with CFG, the model demonstrates good quality and diversity.
Visual Results and General Insights
Although the model primarily focuses on image2audio, it also performs well in audio2image tasks. The guided inference ensures that generated audio aligns well with visual inputs.
Implications and Future Work
The paper's key outcome is that a simple transformer-based approach can handle complex cross-modal generation tasks effectively. This opens several avenues for future research:
- Improved VQGANs: Utilizing more robust VQGANs could further enhance the quality of generated images and audio.
- Broader Applications: The versatility of this approach could extend to other cross-modal tasks like text-to-image or text-to-audio.
- Efficiency: The lightweight nature of the proposed method makes it suitable for real-time applications, potentially transforming areas like augmented reality and interactive media experiences.
Conclusion
This paper highlights a scalable and effective approach to multimodal generation tasks using a generative transformer. The strong performance and simplicity of the proposed model make it a promising candidate for future research and practical applications in multimodal AI.