Direct comparison with Tango-2 (DPO)

Determine the comparative performance between EzAudio and Tango-2, a diffusion-based text-to-audio model optimized via Direct Preference Optimization (DPO), by conducting a direct empirical comparison of the two systems.

Background

In the evaluation against state-of-the-art text-to-audio models, the paper compares EzAudio to Tango-1/AF, AudioLDM-1/2, and Make-An-Audio-1/2, but explicitly omits Tango-2. Tango-2 introduces Direct Preference Optimization (DPO) for aligning diffusion-based text-to-audio generation with human preferences.

The authors note that EzAudio does not use DPO, and therefore a fair or controlled comparison with Tango-2 is deferred. Establishing this comparison would clarify the relative strengths of EzAudio versus a DPO-aligned baseline and assess the impact of DPO on text-audio alignment and audio quality under matched conditions.

References

Since our model does not use Direct Preference Optimization (DPO), we leave a comparison with Tango-2 for future work.

EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion Transformer (2409.10819 - Hai et al., 17 Sep 2024) in Experiments, Subsection "Comparison with State-of-the-art" (footnote attached to "Tango")