Any-to-Any Generation via Composable Diffusion (2305.11846v1)

Published 19 May 2023 in cs.CV, cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: We present Composable Diffusion (CoDi), a novel generative model capable of generating any combination of output modalities, such as language, image, video, or audio, from any combination of input modalities. Unlike existing generative AI systems, CoDi can generate multiple modalities in parallel and its input is not limited to a subset of modalities like text or image. Despite the absence of training datasets for many combinations of modalities, we propose to align modalities in both the input and output space. This allows CoDi to freely condition on any input combination and generate any group of modalities, even if they are not present in the training data. CoDi employs a novel composable generation strategy which involves building a shared multimodal space by bridging alignment in the diffusion process, enabling the synchronized generation of intertwined modalities, such as temporally aligned video and audio. Highly customizable and flexible, CoDi achieves strong joint-modality generation quality, and outperforms or is on par with the unimodal state-of-the-art for single-modality synthesis. The project page with demonstrations and code is at https://codi-gen.github.io

Citations (131)

View on Semantic Scholar

Summary

The paper introduces CoDi, a model that extends latent diffusion to enable flexible multimodal outputs by aligning text, image, video, and audio.
It employs a bridging alignment strategy using contrastive learning to efficiently synchronize representations across different modalities.
The model achieves impressive metrics, including competitive FID and FAD scores, while supporting zero-shot inference across diverse input-output combinations.

Composable Diffusion: An Innovative Approach to Any-to-Any Multimodal Generation

The paper introduces Composable Diffusion (CoDi), a generative model capable of any-to-any multimodal generation, covering output modalities such as text, image, video, and audio from any combination of input modalities. Unlike traditional models which typically focus on generating one modality from another predefined modality (e.g., text-to-image), CoDi innovates by being able to handle any combination of inputs and outputs, facilitating simultaneous multimodal generation.

Core Methodological Advances

Composable Diffusion Structure: CoDi integrates latent diffusion models (LDMs) for individual modalities. It leverages independent training of LDM for each modality such as text, image, video, and audio, ensuring high-quality single-modality generation. CoDi extends this by enabling cross-attention across modalities in its generative processes.
Bridging Alignment Strategy: The research introduces a novel strategy for aligning modalities at both the input and diffusion steps. Text is used as a bridging modality due to its widespread availability in cross-modal datasets. The method applies contrastive learning to align embeddings across different modalities, significantly reducing computational overhead by transforming the problem into a linear number of training objectives.
Multimodal Conditioning and Generation: CoDi employs a shared multimodal space to facilitate interaction between inputs and enable joint generation processes. This involves multimodal conditioning using weighted interpolation of aligned embeddings, which allows zero-shot inference across multiple inputs without direct training in those specific configurations.
Environment Encoder Alignment: To address the challenge of joint multimodal generation, environment encoders are aligned across modalities using contrastive learning. A modular approach allows for interpolation of environments ensuring any-to-any generation capabilities.

Experimental Outcomes

CoDi demonstrates exceptional performance on a diverse range of tasks across multimodal domains. Key metrics include:

Image Generation Performance: Measured by FID scores, CoDi maintains competitive generation quality, performing favorably against other state-of-the-art models.
Audio Generation: CoDi outperforms other models in audio generation tasks, as indicated by benchmark assessments like FAD.
Multimodal Evaluation: CoDi showcases a novel ability to generate synchronized outputs, such as temporally aligned video and audio, maintaining consistent quality across various combinations of inputs.

Implications in AI Development

The implications of CoDi's development are manifold:

Practical Applications: The model's flexibility and efficiency have potential applications in immersive human-computer interactions, enabling coherent and synchronized content delivery across multiple modalities.
Advancements in Generative Models: CoDi's approach could inform future research in generative AI, particularly in the development of versatile, integrated multimodal systems.
Challenges and Future Work: While the paper advances multimodal generation, it also raises challenges such as the complexity in maintaining alignment across a broader range of modalities and scaling issues with increased input-output combinations.

Conclusion

CoDi represents a significant advancement in the field of generative AI by overcoming traditional limitations of modality-specific models. The research not only advances the technical development of multimodal AI systems but also opens up potential for more natural and comprehensive machine understanding of human-like multi-sensory experiences. While the model exhibits promising capabilities, future work could explore refined alignment strategies and broader applications, ensuring ethical considerations, especially in addressing challenges like bias and misinformation that are intrinsic to such powerful generative systems.

Any-to-Any Generation via Composable Diffusion (2305.11846v1)

Summary

Composable Diffusion: An Innovative Approach to Any-to-Any Multimodal Generation

Core Methodological Advances

Experimental Outcomes

Implications in AI Development

Conclusion

GitHub

YouTube

Any-to-Any Generation via Composable Diffusion (2305.11846v1)

Summary

Composable Diffusion: An Innovative Approach to Any-to-Any Multimodal Generation

Core Methodological Advances

Experimental Outcomes

Implications in AI Development

Conclusion

Related Papers

GitHub

YouTube