LitBench: Creative Writing Benchmark
- LitBench is a standardized benchmark and dataset designed for evaluating creative writing through human-annotated, debiased story comparisons from Reddit.
- It systematically compares zero-shot LLM judges and trained reward models, achieving up to 78% agreement with human literary preferences.
- The benchmark mitigates biases such as length, temporal, and popularity effects to ensure reliable and scalable evaluation of open-ended narrative quality.
LitBench is a standardized benchmark and dataset designed for the reliable evaluation and development of creative writing assessment metrics, specifically targeting open-ended story generation by LLMs. Creative writing tasks challenge LLMs and evaluation frameworks because they lack ground-truth references: multiple valid narratives can be generated for the same prompt, and subjective factors such as narrative flair, emotional impact, and reader engagement are paramount. LitBench addresses these challenges by assembling a large-scale, debiased, human-preference-annotated dataset derived from Reddit’s writing community, and by rigorously benchmarking both zero-shot LLM judges and trained reward models for alignment with human literary judgment (2507.00769).
1. Motivation and Context
Creative writing evaluation presents intrinsic difficulties absent in other language generation tasks such as mathematical reasoning or code synthesis. Unlike domains with explicit verification criteria, story quality is subjective—contingent on creativity, structure, and reader appeal. Manual human evaluation is expensive, inconsistent, and unscalable, while automatic metrics (e.g., BLEU, ROUGE) are known to be insufficient for open-ended narratives. As a result, practitioners have resorted to employing LLMs themselves as zero-shot “judges,” yet these methods often lack empirical reliability and may inherit biases (e.g., length bias, style preference). LitBench is introduced to systematically address these evaluation gaps by providing:
- A curated, debiased, and large-scale dataset tailored to creative writing.
- A held-out human-labeled test set enabling reliable benchmarking.
- Analysis and open-sourcing of both LLM-based and learned (reward model) evaluators.
2. Dataset Construction and Properties
LitBench consists of two principal components: a held-out test set for benchmarking and a larger training corpus for training reward models.
Held-Out Test Set
- 2,480 pairwise, debiased, human-labeled story comparisons, covering 3,543 unique stories (average length: 550 words).
- Sourced entirely from Reddit’s r/WritingPrompts subreddit (post-2023, ensuring no overlap with LLM pretraining datasets).
- Each pair consists of two stories written for the same prompt, with preference annotation derived from Reddit upvotes, supplemented by stringent filters:
- Only stories with at least 10 upvotes.
- Upvote differential must be at least 25%.
- Later-posted story must have higher upvotes to mitigate exposure/time bias.
- Length-agnostic pairing: pairs are histogram-balanced to avoid the chosen stories being longer by default.
- Quality control removes stories longer than 2048 tokens or shorter than 50 words.
Training Corpus
- 43,827 pairwise preference labels on 50,309 unique stories, primarily sampled from 2014–2022 Reddit data and licensed under MIT terms for open research.
This construction approach ensures both robust ground for model training and rigorous, unbiased evaluation in the test phase.
3. Evaluation Methodologies
LitBench supports and systematically compares several approaches for automatic evaluation of creative writing:
Zero-Shot LLM Judges
- State-of-the-art LLMs (from Anthropic, OpenAI, Deepseek, etc.) are prompted as neutral judges on unseen story pairs, receiving explicit instructions to assess creative quality.
- To counter prompt and order biases, each judge is run with both pairwise orderings and judiciously optimized prompt templates.
Reward Model Training Strategies
- Bradley-Terry Reward Models: Discriminative models independently score each story in a pair and are fine-tuned so that the chosen/preferred story has a greater score. The model is optimized via the Bradley-Terry loss:
- Generative Reward Models: The model receives the prompt, both stories (A and B), and predicts the preferred label (optionally using a chain-of-thought (CoT) rationale as an intermediate step).
- GenRM (plain) predicts preference directly.
- GenRM-CoT generates a rationale before outputting preference.
Human Validation
- 64 new stories produced by GPT-4.1/4o for 40 prompts are ranked via reward model scores and then compared in an online paper where 46 human annotators (10–13 per story pair) express preference. Agreement rates are measured to evaluate models’ extrapolative alignment with human literary taste.
4. Key Results and Benchmark Outcomes
Zero-Shot LLM Performance
- The best-performing zero-shot judge, Anthropic Claude-3.7-Sonnet, attains 73% agreement with human annotators on the LitBench held-out test set.
- Other major LLMs, including GPT-4.1 and Deepseek-R1, reach between 70–71%.
- Open-source LLMs of moderate size (Llama-3.1-8B, Qwen-2.5-7B, Gemma-7B) achieve just 56–60%—barely outperforming chance—highlighting the limits of direct zero-shot deployment for nuanced literary quality assessment.
Reward Model Performance
- Both the Bradley-Terry and Generative reward models (e.g., Llama-8B, Qwen-7B based) attain 78% agreement with human preference in evaluation.
- Notably, chain-of-thought rationale decreases accuracy in the creative writing setting (dropping to 72%), which contrasts with code and math evaluation domains where CoT generally boosts performance.
- In direct comparison on newly generated stories, reward model choices are selected by human annotators 57% of the time, vs. 41% for rejected stories—outperforming zero-shot judges, which yield chance-level preference.
Statistical Results Table
Method | Agreement w/ Human (%) |
---|---|
Claude-3.7-Sonnet (OTS) | 73 |
GPT-4.1 (OTS) | 70 |
Deepseek-R1 (OTS) | 71 |
Bradley-Terry RM (8B) | 78 |
Generative RM (7B) | 78 |
GenRM-CoT | 72 |
Llama-3.1-8B (OTS) | 58 |
Qwen-2.5-7B (OTS) | 60 |
5. Bias Mitigation and Data Quality
Evaluation and training datasets are debiased on multiple axes:
- Length bias: Winners and losers are matched in length via histogram balancing to prevent systematic reward of verbosity.
- Temporal bias: Chosen stories must be posted later than rejected ones to diminish exposure advantage.
- Popularity filtering: Only comparisons with strong upvote differentials are retained to improve signal-to-noise for community preference as a proxy for literary merit.
Qualitative analysis indicates that paired stories are matched in basic fluency and spelling; selection pivots more on features such as emotional impact, original plot device, or clarity (“clever twist” or “emotional punchline” versus “confusing/dry”).
6. Implications, Applications, and Limitations
Implications
- Large open-source LLMs are currently insufficient for zero-shot creative writing evaluation at human-level reliability.
- Reward models fine-tuned on carefully filtered, debiased human preference data outperform off-the-shelf LLM judges.
- Chain-of-thought rationales may not improve (and could degrade) creative writing evaluation performance, contrasting with other domains.
- Careful curation—including length and time debiasing, as well as removal of low-signal pairs—is vital for reward model robustness and generalization.
Applications
- LitBench serves as a gold-standard, publicly available test set for evaluating creative-writing reward models.
- Supports training and benchmarking of creative writing verifiers and optimization loops such as RLHF for generative models in the literary space.
- Enables further research into biases and evaluation mechanisms for open-ended text generation.
Limitations and Cautions
- Community upvotes as a proxy for literary quality may imperfectly capture expertise or subtle literary distinction; possible cultural and topical biases remain.
- A plausible implication is that reward model performance on synthetic/generative stories demonstrates extrapolative power, but generalization to entirely new genres or culture-specific preferences may require future augmentation of data sources.
7. Resources and Accessibility
- Dataset and trained models: https://huggingface.co/collections/SAA-Lab/litbench-68267b5da3aafe58f9e43461
- Benchmark code and scripts: https://github.com/drfein/LitBench/tree/main
- Training data is MIT-licensed; test sets released as Reddit comment IDs for copyright compliance.
- The full methodology, ablation studies, and additional statistical results are detailed in (2507.00769).
LitBench represents a foundational advance in creative writing evaluation for AI, furnishing a robust, vetted resource for systematic measurement, model development, and alignment research in open-ended language generation.