OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows (2412.01169v1)

Published 2 Dec 2024 in cs.MM, cs.CV, cs.SD, and eess.AS

Abstract: We introduce OmniFlow, a novel generative model designed for any-to-any generation tasks such as text-to-image, text-to-audio, and audio-to-image synthesis. OmniFlow advances the rectified flow (RF) framework used in text-to-image models to handle the joint distribution of multiple modalities. It outperforms previous any-to-any models on a wide range of tasks, such as text-to-image and text-to-audio synthesis. Our work offers three key contributions: First, we extend RF to a multi-modal setting and introduce a novel guidance mechanism, enabling users to flexibly control the alignment between different modalities in the generated outputs. Second, we propose a novel architecture that extends the text-to-image MMDiT architecture of Stable Diffusion 3 and enables audio and text generation. The extended modules can be efficiently pretrained individually and merged with the vanilla text-to-image MMDiT for fine-tuning. Lastly, we conduct a comprehensive study on the design choices of rectified flow transformers for large-scale audio and text generation, providing valuable insights into optimizing performance across diverse modalities. The Code will be available at https://github.com/jacklishufan/OmniFlows.

PDF HTML Abstract

An Examination of OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows

The paper presents OmniFlow, an innovative generative model aimed at advancing the capabilities within the any-to-any generation space, notably targeting text-to-image, text-to-audio, audio-to-image, and more complex multi-modal input and output tasks. Contrary to prior models that often specialize in single-mode generation due to high computational demands and data constraints, OmniFlow seeks to effectively address the joint distribution of multiple modalities through a modular and unified approach.

Key Contributions

OmniFlow stems from the Rectified Flow (RF) framework known for text-to-image generation and expands it into a multi-modal domain. The authors outline three primary contributions in advancing any-to-any generation models:

Multi-Modal Rectified Flow (MMRF): The paper extends RF to accommodate a multi-modal setting, introducing a novel guidance mechanism that permits flexible control over the alignment between diverse modalities in generated outputs. This is crucial as it allows the model to handle varying input and output combinations without compromising on fidelity or realism.
Modular Architecture Design: OmniFlow employs a novel architecture leveraging the Stable Diffusion MMDiT framework, lifting its capabilities from text-to-image to any-to-any generation tasks. This modular architecture allows for efficient pre-training of modality-specific components prior to integration and fine-tuning, dramatically reducing computational overhead.
Comprehensive Architectural Study: Extensive evaluation of the design choices in rectified flow transformers for multi-modal generation is presented, providing insights into optimizing model performance across modalities while ensuring a cohesive output.

Performance and Evaluation

The empirical analysis in the paper highlights that OmniFlow not only outperforms existing any-to-any models across a variety of benchmarks but also maintains competitive performance compared to state-of-the-art single-task models. Notably, OmniFlow achieves competitive FID and CLIP scores in text-to-image synthesis on the MSCOCO-30K and GenEval benchmarks.

OmniFlow also demonstrates superior alignment in text-to-image generation, reflected in higher CLIP scores compared to generalist models like UniDiffuser and CoDi. Additionally, in text-to-audio tasks, OmniFlow exhibits strong results in both FAD and CLAP evaluations on the AudioCaps dataset, surpassing many leading models in qualitative and quantitative measures.

Implications and Future Directions

The introduction of OmniFlow offers several practical and theoretical implications for the future of multi-modal generative models. Practically, its modular design represents a shift towards more computationally viable solutions in generative modeling, especially when handling multiple modality inputs and outputs. Theoretically, it elucidates new insights into integrating rectified flows with diffusion transformers and opens avenues for developing even more generalized models capable of efficiently managing vastly different input-output modality pairs.

Moreover, its ability to flexibly guide the generation process by modulating the influence of input modalities at different stages suggests potential downstream applications in customization-based generation tasks, paving the way for enhanced user interactivity in AI-driven content creation tools.

Conclusion

OmniFlow represents an important step toward realizing versatile multi-modal generative models, addressing some of the significant challenges in any-to-any generation. Its modular framework, paired with insightful evaluations, positions it as both a powerful tool and a foundation upon which future models can be built and optimized. Its release as an open-source model will likely serve as a catalyst for further exploration and innovation within the domain of multi-modal generative AI. The comprehensive nature of this research not only situates OmniFlow as a state-of-the-art model but also underscores the evolution of generative modeling strategies towards greater inclusivity of modality diversity.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Shufan Li (19 papers)
Konstantinos Kallidromitis (10 papers)
Akash Gokul (13 papers)
Zichun Liao (1 paper)
Yusuke Kato (54 papers)
Kazuki Kozuka (18 papers)
Aditya Grover (82 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/li78658171/status/1864924679053480238

https://twitter.com/arXivGPT/status/1865460549619753075

YouTube

Show All Videos