Papers
Topics
Authors
Recent
Search
2000 character limit reached

UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning

Published 7 Apr 2026 in cs.AI | (2604.05517v1)

Abstract: A fundamental challenge in creative writing lies in reconciling the inherent tension between maintaining global coherence in long-form narratives and preserving local expressiveness in short-form texts. While long-context generation necessitates explicit macroscopic planning, short-form creativity often demands spontaneous, constraint-free expression. Existing alignment paradigms, however, typically employ static reward signals and rely heavily on high-quality supervised data, which is costly and difficult to scale. To address this, we propose \textbf{UniCreative}, a unified reference-free reinforcement learning framework. We first introduce \textbf{AC-GenRM}, an adaptive constraint-aware reward model that dynamically synthesizes query-specific criteria to provide fine-grained preference judgments. Leveraging these signals, we propose \textbf{ACPO}, a policy optimization algorithm that aligns models with human preferences across both content quality and structural paradigms without supervised fine-tuning and ground-truth references. Empirical results demonstrate that AC-GenRM aligns closely with expert evaluations, while ACPO significantly enhances performance across diverse writing tasks. Crucially, our analysis reveals an emergent meta-cognitive ability: the model learns to autonomously differentiate between tasks requiring rigorous planning and those favoring direct generation, validating the effectiveness of our direct alignment approach.

Summary

  • The paper introduces a novel reference-free RL approach that unifies long-form planning with short-form spontaneous generation.
  • It employs an adaptive reward model (AC-GenRM) and group-based optimization to enhance both structural coherence and creative expressiveness.
  • Empirical results show significant gains in narrative structuring and expressive output, highlighting emergent meta-cognitive capabilities.

UniCreative: Unifying Paradigms for Creative Text Generation via Reference-Free RL

Introduction

The paper "UniCreative: Unifying Long-form Logic and Short-form Sparkle via Reference-Free Reinforcement Learning" (2604.05517) addresses the core challenge in creative text generation: reconciling the divergent requirements of long-form logical coherence and short-form expressive spontaneity. Existing LLMs are limited by their reliance on static alignment pipelines and high-quality supervised reference data, which are neither economically scalable nor adaptable to tasks with inherently subjective evaluation signals. The authors propose a unified reinforcement learning (RL) framework that adaptively selects between Plan-then-Write and Direct Generation paradigms, employing a novel reference-free RL approach that facilitates emergent meta-cognitive capabilities. Figure 1

Figure 1

Figure 1: Examples of UniCreative generations. The long-form task (left) follows a Plan-then-Write procedure, while the short-form task (right) employs direct generation without intermediate planning.

Methodological Framework

Dual-Mode Creative Generation

UniCreative decomposes creative writing into two regimes: tasks requiring macroscopic planning for long-range consistency and those favoring high-entropy, stochastic expressiveness. The system introduces an explicit computational switch, leveraging a Plan-then-Write mode for narratives that require hierarchical reasoning and Direct Generation for tasks where planning induces over-determination and linguistic homogenization. Figure 2

Figure 2: The UniCreative architecture selects between planning or direct modes per task and optimizes via ACPO using feedback from the generative reward model.

Adaptive Constraint-Aware Reward Modeling

Central to UniCreative is AC-GenRM, a generative reward model performing dynamic criteria synthesis and debiased pairwise judging. Unlike prior static, scalar reward frameworks, AC-GenRM synthesizes query-specific evaluation dimensions CxC_x based on each prompt and trains with symmetrical data augmentation to enforce position-invariant quality discrimination. This alleviates known biases in LLM judges and enhances alignment with expert creative standards.

Reference-Free Reinforcement Learning: ACPO

The Adaptive Constraint Preference Optimization (ACPO) method eschews SFT and external ground-truth references. Instead, it leverages group-based internal rollouts, combining three orthogonal RL signals:

  • Relative rewards from self-play driven by AC-GenRM judgments
  • Paradigm-aware structural penalties to enforce contextually appropriate cognitive regimes
  • Adaptive length regularization to prevent mode collapse—either content starvation in long-form or verbosity in short-form

Policy updates use Group Relative Policy Optimization (GRPO), normalizing reward signals within groups to reduce variance and stabilize optimization without auxiliary value models.

Empirical Evaluation

Long-Form Writing and Reasoning

On WritingBench, UniCreative exhibits robust improvements over diverse open-source and proprietary LLMs. For instance, Qwen3-8B-Thinking + RL achieves competitive average scores (82.42), rivalling Claude-Sonnet-3.7 and DeepSeek-R1, and surpasses much larger instruction-tuned baselines (e.g., Llama-3.3-70B-Instruct, Qwen-2.5-72B) by margins exceeding 20–30 points. Notably, RL alignment boosts compliance in structurally-intensive requirements (style, format, length), overcoming the structural drift and topic degradation typical of base autoregressive models.

Short-Form Creativity and Over-Determination

On the Blessing short-form benchmark, reference-free RL enables the model to bypass rigid planning when detrimental, with Qwen3-8B-Thinking + RL attaining 93.6% “excellent” ratings, matching Claude-Sonnet-4.5. The RL-optimized models gain 17–26% over “thinking” baselines, empirically confirming that adaptive switching restores the linguistic diversity and emotional resonance suppressed by SFT and monolithic planning. Figure 3

Figure 3: Illustrates over-determination, where inappropriate application of planning impairs short-form creative generation.

Figure 4

Figure 4: Structural collapse arises in long-form generation without explicit macroscopic planning, leading to incoherent narratives.

Reward Model Evaluation

AC-GenRM, the learned critic, demonstrates strong agreement rates (80.7%) with expert LLM judges, outperforming major discriminative baselines and minimizing verbosity bias. The architecture’s ability to generate transparent, human-interpretable evaluation rationales (Figure 5) represents a substantive advance in explainable reward modeling. Figure 5

Figure 5: AC-GenRM dynamically generates task-adaptive criteria for each evaluation instance.

Emergent Meta-Cognition

A salient contribution is the emergence of robust task regime discrimination in models with adequate capacity. On Mode Discrimination Benchmarks, models at the 4B and 8B scale attain up to 96% accuracy in selecting the optimal generation strategy per prompt, a property absent in sub-2B models due to parameter bottlenecks. This empirical trend suggests that meta-cognitive inference—the ability to map latent prompt complexity to the correct computational pathway—emerges as a function of model size and is not encoded by SFT or static RL alignment alone.

Analysis and Implications

The results have far-reaching theoretical and practical implications. The framework operationalizes a move from static, reference-anchored alignment to dynamic, reward-driven self-organization—aligning with the “reward modeling as reasoning” trend in RLHF research. By decoupling the creative feedback signal from expensive, often ill-defined ground-truth annotations, UniCreative dramatically improves scaling for open-ended, subjective tasks.

On the practical front, the reference-free dual-mode RL paradigm is well-positioned for large-scale, cost-effective model alignment in domains where ground-truth is ambiguous (ideation, artistic writing, “sparkle-first” applications). The transparency of AC-GenRM criteria generation offers the additional benefit of interpretability, enhancing trust in deployed generative systems.

However, significant model capacity is required for stable emergent behavior; lightweight models fail to balance structural logic and linguistic vibrancy. In addition, the framework recognizes limitations in intermediate-length (“gray area”) tasks and exposes the high computational overhead of ultra-long context RL training, presenting challenges for practitioners with limited hardware.

Future Directions

Potential extensions include:

  • Soft or hierarchical planning strategies for intermediate regimes
  • Reducing RL sample and compute requirements via more efficient group-based updates
  • Fusing dynamic reward modeling with cross-domain planning for broader creative task generalization
  • Exploiting meta-cognitive signals for prompt-adaptive LLMs in diverse, open-ended tasks

Conclusion

UniCreative formalizes, for the first time, a scalable RL-based approach to unifying long-form structural reasoning and short-form spontaneity in creative text generation, without reliance on SFT or ground-truth completions. The joint adoption of AC-GenRM and ACPO enables interpretability, RL curriculum alignment, and an emergent capacity for meta-cognitive task differentiation. These innovations provide a new foundation for scalable, adaptive LLM alignment in subjective, high-entropy creative domains.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.