Modifying Large Language Model Post-Training for Diverse Creative Writing (2503.17126v1)

Published 21 Mar 2025 in cs.CL and cs.LG

Abstract: As creative writing tasks do not have singular correct answers, LLMs trained to perform these tasks should be able to generate diverse valid outputs. However, LLM post-training often focuses on improving generation quality but neglects to facilitate output diversity. Hence, in creative writing generation, we investigate post-training approaches to promote both output diversity and quality. Our core idea is to include deviation -- the degree of difference between a training sample and all other samples with the same prompt -- in the training objective to facilitate learning from rare high-quality instances. By adopting our approach to direct preference optimization (DPO) and odds ratio preference optimization (ORPO), we demonstrate that we can promote the output diversity of trained models while minimally decreasing quality. Our best model with 8B parameters could achieve on-par diversity as a human-created dataset while having output quality similar to the best instruction-tuned models we examined, GPT-4o and DeepSeek-R1. We further validate our approaches with a human evaluation, an ablation, and a comparison to an existing diversification approach, DivPO.

PDF Abstract

Overview of "Modifying LLM Post-Training for Diverse Creative Writing"

The paper "Modifying LLM Post-Training for Diverse Creative Writing" explores refining LLMs to enhance both the quality and diversity of creative text generation. Traditional post-training approaches, such as Proximal Policy Optimization (PPO) and Direct Preference Optimization (DPO), often focus primarily on improving the quality of LLM outputs. This frequently results in reduced output diversity, thereby limiting the models' utility in tasks that benefit from creative variance. The authors address this limitation by introducing a training methodology that incorporates deviation measures—quantifying the uniqueness of each training instance relative to others sharing the same prompt—into the optimization process.

The paper proposes variations of DPO and the newly introduced Odds Ratio Preference Optimization (ORPO), with deviations factored into the training objectives to enhance diversity while minimally impacting quality. Through experiments and evaluations, the authors demonstrate their refined models' capacity to produce diverse yet high-quality creative content, achieving comparability with human-produced datasets in terms of diversity while maintaining quality levels parallel to leading instruction-tuned models like GPT-4o and DeepSeek-R1.

Technical Contributions

Diversified Optimization Techniques: The research introduces Diversified Direct Preference Optimization (DDPO) and Diversified Odds Ratio Preference Optimization (DORPO) by incorporating semantic and stylistic deviation into traditional post-training approaches. By weighting these deviations into the loss functions, the authors emphasize learning from less typical instances that still exhibit high quality.
Evaluation Metrics: The paper employs semantic and style diversity metrics using embeddings to quantify diversity, in addition to a reward model trained to assess writing quality through transformations of user-vote scores.
Empirical Validation: Models trained via DDPO and DORPO showed higher semantic and style diversity than traditional models without considerable degradation in writing quality. This was supported by both automated metrics and human evaluations. Human raters identified outputs from models using DDPO as both higher quality and more diverse than non-diversified counterparts like GPT-4o.

Numerical Results and Implications

The enhanced models matched the natural diversity found in human-crafted datasets yet sustained output quality on par with contemporary, high-quality models.
DDPO and DORPO models demonstrated increased diversity metrics (as measured by mean pairwise distance in embeddings) without substantive compromises on model quality items like logical coherence and task relevance, which were rigorously tested through a controlled human evaluation phase.

These results indicate that integrating deviation measures allows LLMs to explore creative spaces more broadly without significant losses in coherence or human-like output characteristics. The implication is a methodological enhancement to LLM training that provides end-users with a broader array of creative outputs aligning closer to human imagination diversity.

Future Directions

The integration of deviation signals into LLM training advocates for future developments that further refine this balancing act between diversity and quality. Possible avenues may involve extending these strategies to other domains where diversity is equally crucial, such as dialogue systems, educational technologies, and cultural narratives. Additionally, exploring reward models that better capture subjective aspects of creative writing could further improve the robustness of both quality and diversity evaluations.

In conclusion, this paper contributes a substantial methodological advancement to the field of artificial creativity, paving the way for future work unlocking LLMs' full potential in creative writing and possibly other areas requiring balanced output diversity. Such advances will likely advance the capabilities of AI as a collaborative tool in creative processes across numerous domains.