Generalization of text-to-image flow-matching recipes to audio and text modalities
Determine whether the training configurations for flow-based generative models that are effective in text-to-image synthesis—specifically the use of a logit-normal distribution over timesteps and related variant choices—directly generalize to multi-modal generative modeling tasks involving audio and text modalities within the rectified flow framework.
Sponsor
References
They also explored a logit-normal distribution of timestep t for text-to-image generation. We explore all these variants in the context of multi-modal generation, particularly for audio and text, as it is unclear if the results from text-to-image domain can be directly generalized.
— OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows
(2412.01169 - Li et al., 2 Dec 2024) in Section 2.1 (Flow-Based Generative Models)