Diversity-Rewarded CFG Distillation (2410.06084v1)

Published 8 Oct 2024 in cs.LG

Abstract: Generative models are transforming creative domains such as music generation, with inference-time strategies like Classifier-Free Guidance (CFG) playing a crucial role. However, CFG doubles inference cost while limiting originality and diversity across generated contents. In this paper, we introduce diversity-rewarded CFG distillation, a novel finetuning procedure that distills the strengths of CFG while addressing its limitations. Our approach optimises two training objectives: (1) a distillation objective, encouraging the model alone (without CFG) to imitate the CFG-augmented predictions, and (2) an RL objective with a diversity reward, promoting the generation of diverse outputs for a given prompt. By finetuning, we learn model weights with the ability to generate high-quality and diverse outputs, without any inference overhead. This also unlocks the potential of weight-based model merging strategies: by interpolating between the weights of two models (the first focusing on quality, the second on diversity), we can control the quality-diversity trade-off at deployment time, and even further boost performance. We conduct extensive experiments on the MusicLM (Agostinelli et al., 2023) text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our finetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG. Explore our generations at https://google-research.github.io/seanet/musiclm/diverse_music/.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a distillation method that mimics CFG-enhanced predictions to lower computational overhead.
The paper integrates reinforcement learning with a diversity reward, using negative cosine similarity to promote varied content.
The paper employs model weight interpolation to balance quality and diversity, achieving superior trade-offs in music generation.

Diversity-Rewarded CFG Distillation: A Review

"Diversity-Rewarded CFG Distillation" explores advancements in generative model training, focusing on music generation, with the primary aim of optimizing the quality-diversity trade-off during content generation. The paper presents innovative finetuning techniques to distill classifier-free guidance (CFG) into model weights while addressing the computational overhead and diversity reduction commonly associated with CFG.

Background and Objectives

Generative models in creative domains such as music often use inference strategies like Classifier-Free Guidance (CFG) to enhance output fidelity. Although CFG is notable for improving content alignment with user prompts, it incurs a doubled inference cost and reduces output diversity. This paper proposes a novel finetuning procedure — diversity-rewarded CFG distillation — targeting these limitations. The proposal integrates CFG's constructive aspects while promoting diverse outputs through reinforcement learning (RL).

Methodology

The primary contributions lie in three core areas:

CFG Distillation for Quality: The paper introduces a distillation objective allowing a model to mimic CFG-augmented predictions. By minimizing the KL divergence between CFG-enhanced logits and those generated by the base model, the dependence on CFG at inference is eliminated thus reducing computational overhead.
Reinforcement Learning for Diversity: A diversity reward embedded within an RL framework encourages generation variance. This is measured by embedding generations and computing negative cosine similarity, promoting the creation of varied content for a single prompt without sacrificing quality.
Model Merging for Quality-Diversity Trade-off: By interpolating between the weights of models focusing separately on quality and diversity, the methodology allows dynamic adjustment of quality and diversity at deployment. This weight-based merging strategy is informed by recent insights into linear mode connectivity.

Experimental Results

The research was applied to text-to-music generation, assessing a model based on MusicLM. The findings indicate the proposed strategy surpasses traditional CFG in quality-diversity Pareto optimality. Specifically, the model achieves a beneficial balance by employing CFG distillation to enhance quality and RL for maintaining diversity. Human evaluators corroborated these results, indicating higher quality-diversity outputs compared to CFG-augmented models.

Figures such as quality-diversity trade-offs demonstrated how the interpolation of model weights (using a linear interpolation strategy) provided a robust front of solutions surpassing CFG alone. Notably, model merging facilitated strong performance improvements, even uncovering solutions that balanced quality and diversity more efficiently.

Implications and Future Directions

Practically, this research introduces an approach to enhance creative AI applications, particularly where balancing fidelity and variety is essential. Theoretically, it contributes to understanding how distillation and RL can symbiotically enhance generative model performance. Future exploration could involve extending this methodology to other generative domains (e.g., text, image) and testing alternative diversity measures or more nuanced model merging strategies.

Conclusion

"Diversity-Rewarded CFG Distillation" successfully addresses key constraints in generative models using CFG by introducing a method that retains quality and promotes diversity without the computational overhead traditionally required at inference. The approach presents promising avenues for deploying AI in creative tasks, underscoring the interplay between distillation, reinforcement learning, and model merging to achieve optimized generative performance.