Generalization of text-to-image flow-matching recipes to audio and text modalities

Determine whether the training configurations for flow-based generative models that are effective in text-to-image synthesis—specifically the use of a logit-normal distribution over timesteps and related variant choices—directly generalize to multi-modal generative modeling tasks involving audio and text modalities within the rectified flow framework.

Background

The paper reviews a unified formulation for several diffusion and flow-matching objectives and notes that recent text-to-image models, such as Stable Diffusion 3, explored specific timestep distributions (e.g., logit-normal) that improved performance in that domain.

Given the shift from single-modality text-to-image generation to multi-modal settings that include audio and text, the authors highlight uncertainty about whether those successful text-to-image design choices transfer directly, motivating a systematic exploration of objective variants and schedules for audio and text generation.

References

They also explored a logit-normal distribution of timestep t for text-to-image generation. We explore all these variants in the context of multi-modal generation, particularly for audio and text, as it is unclear if the results from text-to-image domain can be directly generalized.

OmniFlow: Any-to-Any Generation with Multi-Modal Rectified Flows (2412.01169 - Li et al., 2 Dec 2024) in Section 2.1 (Flow-Based Generative Models)