Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 15 tok/s Pro

GPT-5 High 19 tok/s Pro

GPT-4o 90 tok/s Pro

Kimi K2 211 tok/s Pro

GPT OSS 120B 459 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Igniting Creative Writing in Small Language Models: LLM-as-a-Judge versus Multi-Agent Refined Rewards (2508.21476v1)

Published 29 Aug 2025 in cs.CL and cs.AI

Abstract: LLMs have demonstrated remarkable creative writing capabilities, yet their substantial computational demands hinder widespread use. Enhancing Small LLMs (SLMs) offers a promising alternative, but current methods like Supervised Fine-Tuning (SFT) struggle with novelty, and Reinforcement Learning from Human Feedback (RLHF) is costly. This paper explores two distinct AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework to ignite the creative writing of a 7B-parameter SLM, specifically for generating Chinese greetings. The first strategy employs a RM trained on high-quality preference data curated by a novel multi-agent rejection sampling framework designed for creative tasks. The second, more novel strategy utilizes a principle-guided LLM-as-a-Judge, whose reward function is optimized via an adversarial training scheme with a reflection mechanism, to directly provide reward signals. Comprehensive experiments reveal that while both approaches significantly enhance creative output over baselines, the principle-guided LLM-as-a-Judge demonstrably yields superior generation quality. Furthermore, it offers notable advantages in training efficiency and reduced dependency on human-annotated data, presenting a more scalable and effective path towards creative SLMs. Our automated evaluation methods also exhibit strong alignment with human judgments. Our code and data are publicly available at https://github.com/weixiaolong94-hub/Igniting-Creative-Writing-in-Small-Language-Models.

Collections

Summary

The paper demonstrates that both LLM-as-a-Judge and multi-agent refined rewards significantly boost creative writing in SLMs, with the judge method achieving state-of-the-art excellence rates.
It details a dual methodology using a reward model via multi-agent debate and a principle-guided adversarial-reflective mechanism for generating discrete binary rewards through GRPO optimization.
Empirical results, including human-model agreement rates up to 87%, underscore improved training efficiency and reduced dependency on human annotations for creative tasks.

Enhancing Creative Writing in Small LLMs: Comparative Analysis of LLM-as-a-Judge and Multi-Agent Refined Rewards

Introduction

The paper "Igniting Creative Writing in Small LLMs: LLM-as-a-Judge versus Multi-Agent Refined Rewards" (2508.21476) addresses the challenge of augmenting creative writing capabilities in Small LLMs (SLMs), specifically focusing on the generation of Chinese greetings. The work systematically compares two AI-driven reward strategies within a Reinforcement Learning from AI Feedback (RLAIF) framework: (1) a multi-agent system for preference data curation and reward model (RM) training, and (2) a principle-guided LLM-as-a-Judge paradigm, optimized via adversarial training and reflection. The paper demonstrates that both approaches substantially improve creative output over baselines, with the LLM-as-a-Judge method yielding superior generation quality, training efficiency, and reduced reliance on human-annotated data.

Methodological Frameworks

Multi-Agent Rejection Sampling and Reward Model Training

The multi-agent framework operationalizes collaborative evaluation by decomposing the assessment process into specialized agents: Retrieval, Positive, Negative, Judge, and Reflect. The Retrieval Agent provides contextual grounding via high-quality prompt-response pairs. The Positive and Negative Agents conduct adversarial debate, surfacing strengths and weaknesses of candidate responses. The Judge Agent synthesizes these perspectives, while the Reflect Agent performs error analysis and ratification, ensuring logical consistency and completeness.

This multi-agent system generates high-fidelity preference data, which is used to train a scalar reward model via LoRA-based fine-tuning on a backbone LLM. The reward model is optimized using the Bradley–Terry loss, enabling it to predict nuanced preferences for creative writing tasks.

Figure 1: Two distinct reward signals: Signal 1 from a multi-agent system yielding a reward model, and Signal 2 from adversarial interaction and reflection, both used to train GRPO.

Principle-Guided LLM-as-a-Judge with Adversarial Optimization

The LLM-as-a-Judge paradigm leverages a powerful LLM as a direct reward provider, guided by explicit creative writing principles. The reward function is optimized through adversarial training involving a Generator (producing challenging "bad" responses) and a Detector (discriminating response quality). A Reflector module further enhances the Detector's reliability by providing supervised feedback on misclassifications, grounding the learning process with true-labeled data.

This approach produces a discrete binary reward signal, which is directly used for policy optimization via the GRPO algorithm. The adversarial-reflective loop iteratively refines the evaluation prompt and the Detector's discriminative capacity.

Experimental Design and Evaluation

Task and Dataset Construction

The paper focuses on the culturally rich domain of Chinese greetings, leveraging datasets for retrieval augmentation, reward model training, policy optimization, and final evaluation. The evaluation rubric encompasses five weighted dimensions: Language Quality, Creativity, Emotional Resonance, Cultural Appropriateness, and Content Richness. Human experts, all native Chinese speakers with graduate-level education, provide rigorous multi-dimensional assessments.

Automated and Human Evaluation Alignment

Empirical results indicate strong alignment between automated evaluation frameworks and human judgments, with agreement rates consistently exceeding 70% and peaking at 87% for the multi-agent system.

Figure 2: Agreement rate comparison between models and human evaluators under two evaluation frameworks.

Comparative Results and Ablation Analysis

Reward Model + RL vs. LLM-as-a-Judge + RL

Both reward strategies significantly outperform SFT-only baselines and mainstream LLMs in generating high-quality greetings. The SFT+RM+RL pipeline achieves notable gains in excellence rates across evaluation metrics. However, the LLM-as-a-Judge + RL approach achieves state-of-the-art performance, with excellence rates of 92.4%, 96.6%, and 95.0% across three metrics, surpassing GPT-4o, Ernie-4.5, and DeepSeek-V3.

Training efficiency is markedly higher for LLM-as-a-Judge, which circumvents the complexity and resource demands of multi-agent data curation. The adversarial-reflective optimization yields a robust, principle-guided evaluation prompt, streamlining reward signal generation.

Figure 3: Training metrics of LLM-as-a-Judge + RL, illustrating stable and effective policy optimization.

Ablation Study

Ablation experiments confirm the criticality of each agent in the multi-agent framework. Removal of debate agents leads to catastrophic drops in recall and precision, underscoring the necessity of adversarial debate for balanced assessment. The Reflect Agent is pivotal for error correction and reliability in both frameworks, with its absence causing significant declines in accuracy and F1-score.

Qualitative Analysis

Case studies illustrate the nuanced evaluation process, with Positive and Negative Agents articulating advantages and disadvantages of candidate responses. Prompts for each agent are meticulously designed to elicit comprehensive assessments.

Figure 4: Example of positive and negative agent outputs for a given query and response.

Implementation Considerations

The reward model is implemented using Llama Factory and LoRA, trained on preference pairs with the Bradley–Terry loss. GRPO-based policy optimization is conducted on Qwen2.5-7B-Instruct, with hyperparameters tuned for stability and efficiency. Experiments utilize four NVIDIA A100 GPUs (80GB each), ensuring scalability for real-world deployment.

Implications and Future Directions

The demonstrated efficacy of principle-guided LLM-as-a-Judge for creative writing in SLMs has several practical and theoretical implications:

Scalability: The reduced dependency on human annotation and streamlined reward signal generation facilitate broader deployment of SLMs in resource-constrained environments.
Generalizability: While the paper focuses on Chinese greetings, the frameworks are extensible to other creative domains and languages, pending further validation.
Bias and Subjectivity: The explicit definition of creative principles and multi-agent debate mitigate, but do not eliminate, the risk of embedding cultural or societal biases. Future work should explore adaptive principle sets and deeper reflection mechanisms.
Model Size and Efficiency: The approaches are validated on 7B-parameter SLMs; exploration of their efficacy on smaller or larger models is warranted.

Conclusion

This paper provides a rigorous comparative analysis of two AI-driven reward strategies for enhancing creative writing in SLMs. Both the multi-agent refined RM and the principle-guided LLM-as-a-Judge approaches yield substantial improvements over baselines, with the latter offering superior generation quality, training efficiency, and scalability. The strong alignment between automated and human evaluations substantiates the viability of these methods for subjective generative tasks. The findings pave the way for efficient, high-quality creative text generation in compact LLMs, with broad applicability across domains and languages.