RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing

Published 26 Aug 2025 in cs.AI and cs.CL | (2508.18642v2)

Abstract: LLMs are extensively utilized in creative writing applications. Creative writing requires a balance between subjective writing quality (e.g., literariness and emotional expression) and objective constraint following (e.g., format requirements and word limits). Existing methods find it difficult to balance these two aspects: single reward strategies fail to improve both abilities simultaneously, while fixed-weight mixed-reward methods lack the ability to adapt to different writing scenarios. To address this problem, we propose Reinforcement Learning with Mixed Rewards (RLMR), utilizing a dynamically mixed reward system from a writing reward model evaluating subjective writing quality and a constraint verification model assessing objective constraint following. The constraint following reward weight is adjusted dynamically according to the writing quality within sampled groups, ensuring that samples violating constraints get negative advantage in GRPO and thus penalized during training, which is the key innovation of this proposed method. We conduct automated and manual evaluations across diverse model families from 8B to 72B parameters. Additionally, we construct a real-world writing benchmark named WriteEval for comprehensive evaluation. Results illustrate that our method achieves consistent improvements in both instruction following (IFEval from 83.36% to 86.65%) and writing quality (72.75% win rate in manual expert pairwise evaluations on WriteEval). To the best of our knowledge, RLMR is the first work to combine subjective preferences with objective verification in online RL training, providing an effective solution for multi-dimensional creative writing optimization.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents RLMR, a novel reinforcement learning framework that dynamically balances subjective writing quality with constraint adherence.
RLMR employs Group Relative Policy Optimization (GRPO) to penalize constraint violations while preserving creative expression.
Experiments across models from 8B to 72B parameters show improved instruction following and overall performance over fixed-weight baselines.

Reinforcement Learning with Mixed Rewards for Creative Writing

Introduction

The study titled "RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing" (2508.18642) addresses the challenges encountered when deploying LLMs in creative writing. Creative writing demands balancing subjective qualities like emotional depth and originality with objective constraints like format adherence and word limits. Traditional single reward strategies are inadequate, as they cannot simultaneously optimize these dual dimensions. Fixed-weight mixed reward strategies also fall short due to their inability to adapt to diverse writing scenarios. The paper introduces Reinforcement Learning with Mixed Rewards (RLMR), featuring a dynamic system combining evaluations from both subjective writing quality and objective constraint adherence models.

Figure 1: Comparison of single reward strategy versus our mixed RLMR approach. Given a task requiring an advertising slogan starting with "X" using no more than 15 words, Response A follows constraints but scores lower (8.9), while Response B violates constraints but scores higher (9.0). Single reward strategies incorrectly prefer Response B, while our RLMR combines writing quality and instruction following signals to correctly identify Response A as superior through dynamic penalty adjustments.

The innovative strategy dynamically adjusts reward weights based on the sampled group's writing quality, ensuring samples violating constraints receive penalties proportionate to their deviation. This comprehensive approach leverages Group Relative Policy Optimization (GRPO) for efficient online reinforcement learning, addressing creative writing optimization with a focus on multi-dimensional reward signals.

RLMR Framework

The RLMR framework integrates subjective and objective evaluations through two distinct models: a writing reward model and a constraint verification model. The writing reward model assesses subjective elements such as literary expression and coherence, trained using human preference pairs to refine its judgment capabilities. Simultaneously, the constraint verification model checks adherence to prescribed requirements, employing logical conjunctions to confirm compliance.

Figure 2: Overview of our Dynamic Mixed-Reward GRPO Framework. The policy model generates responses ( $o_1, o_2, o_3$ ) evaluated by both writing quality (Writing RM) and constraint compliance (Validator).

To overcome the limitations associated with fixed-weight strategies, the RLMR dynamically adjusts rewards by calculating penalties for constraint violations. This ensures that samples violating constraints receive negative advantages during training, thereby optimizing both writing quality and instruction following. The penalty term $\delta$ is dynamically computed to maintain constraint adherence without compromising creative output.

Experimental Evaluation

Extensive experiments validate the RLMR framework across multiple model families and scales, implementing both automated and manual evaluations.

Automated Evaluation:

The framework was tested on models ranging from 8B to 72B parameters, showcasing substantial improvements over baseline methods. On benchmarks like WriteEval and IFEval, the RLMR approach consistently outperformed single-reward and fixed-weight baseline methods, indicating a robust ability to balance writing quality with constraint adherence.

Manual Evaluation:

In human evaluation, RLMR showed superior scores in Instruction Following, Content Quality, and Overall Performance. The method achieved higher user satisfaction levels, confirming its practicality in real-world scenarios.

Figure 3: Pairwise comparison results for Overall Performance. "Win" indicates the left method outperforms the right method; "tie" indicates comparable performance; "lose" indicates the left method underperforms. The red dashed line represents equal performance (50\%). RLMR demonstrates significant advantages over both baseline methods.

Implications and Future Directions

The study provides a viable framework for optimizing LLMs in complex creative writing tasks through a dual focus on subjective quality and objective adherence. It implies potential applications beyond creative writing, including dialogue systems and code generation, where the integration of multiple dimensional reward signals can offer significant benefits.

Future research should explore expanding the dynamic reward system to other multi-signal environments and explore optimizing the balance between subjective preferences and objective constraints in diverse application domains.

Conclusion

The "RLMR: Reinforcement Learning with Mixed Rewards for Creative Writing" research introduces a novel dynamic mixed-reward framework that significantly enhances creative writing optimization. By effectively balancing subjective and objective evaluation criteria, RLMR offers a robust solution to the inherent challenges of creative writing tasks. The framework's contributions mark a step forward in reinforcement learning strategies, promising new horizons in AI-driven creativity and operational efficacy.

Markdown Report Issue