DeepWriting-20K: Deep Reasoning Dataset

Updated 9 September 2025

The paper introduces the innovative REER paradigm that reverse-engineers reasoning steps to improve coherence in open-ended generation tasks.
DeepWriting-20K is a comprehensive dataset covering 25 diverse categories, including creative, academic, and functional writing, to support multi-domain applications.
Empirical results show that models like DeepWriter-8B trained on DeepWriting-20K achieve state-of-the-art performance on benchmarks for long-context and creative writing.

DeepWriting-20K is a curated dataset comprising 20,000 deep reasoning trajectories designed to address the challenges of open-ended generation tasks, with a pronounced emphasis on creative writing. It is constructed using the Reverse-Engineered Reasoning (REER) paradigm—a computational approach that infers latent, stepwise reasoning processes by searching backward from high-quality solutions—to equip LLMs with structured, human-like reasoning capabilities. DeepWriting-20K spans a diverse set of open-ended tasks and genres, providing a foundation for training models that achieve both technical and creative excellence in natural language generation.

1. Scope and Composition of DeepWriting-20K

The dataset is sourced from a broad spectrum of open-ended tasks, prioritizing both diversity and depth within each category. The initial (query, solution) pairs are carefully collected to represent various real-world writing situations and genres:

Ordinary-life question-answering
Academic writing
Functional writing
Creative writing, with special attention to literature and arts, including creative storytelling and essay writing

In total, DeepWriting-20K encompasses 25 manually nominated categories. This granularity ensures that the dataset addresses not only creative aspects but also the structured, technical reasoning required for disciplines such as academic prose and functional exposition. The category design provides a foundation for LLMs to generalize across both technical and artistic domains.

2. Reverse-Engineered Reasoning (REER) Curation Methodology

The central methodological innovation is the use of the Reverse-Engineered Reasoning paradigm. REER differs fundamentally from dominant approaches like reinforcement learning (RL) and instruction distillation. While RL is hindered by unclear reward signals in open-ended tasks and distillation incurs high computational costs, REER extracts reasoning processes from known, high-quality outputs.

The curation workflow proceeds as follows:

Collection of high-quality (query, solution) pairs:
- Sourced from public writing platforms (e.g., r/WritingPrompts, using upvotes as a quality proxy)
- Extracted from public-domain literature (classics from Project Gutenberg, with queries derived from initial paragraphs)
- Incorporated from public instruction tuning resources (notably WildChat and LongWriter6K)
Synthesis of deep reasoning trajectories:
- For each pair, a gradient-free local search algorithm iteratively modifies an initial, imperfect reasoning sequence segment by segment
- The objective is to minimize the perplexity ( $\text{PPL}(y \mid x, z)$ ) of generating the correct solution $y$ , given query $x$ and candidate trajectory $z$
- The optimization formula is given by:
$z^* = \arg\min_{z \in \mathcal{Z}} \text{PPL}(y \mid x, z)$
Filtering and refinement:
- End-of-Thinking Filtering removes trajectories lacking proper conclusions
- Repetition Filtering (n-gram based) avoids degenerative, repetitive loops
Data assembly and blending:
- The resulting 20,000 trajectories are blended with other public reasoning-process datasets (spanning mathematics, coding, science) to maintain balance and avoid overfitting to any single task domain

3. Structural and Functional Characteristics

DeepWriting-20K exhibits several distinguishing structural and functional features:

Absence of RL and distillation: It captures deep reasoning for open-ended generative tasks without recourse to reinforcement learning or expensive teacher-model distillation
Systematic, gradient-free local search: The use of perplexity as a proxy metric during iterative refinement ensures that intermediate "thinking" steps in trajectories are both granular and directly linked to output quality
Breadth and diversity: The broad categorical span, particularly in creative and literary subdomains, supports generalization to diverse writing styles and requirements
Intermediate thinking induction: By providing detailed, stepwise reasoning traces, it imparts a strong inductive bias to models, supporting the maintenance of coherence and logical structure in long-context text generation

A plausible implication is that the dataset's structure is particularly suited to research questions involving long-range planning and creativity in LLMs.

4. Empirical Results: DeepWriter-8B Training and Evaluation

The DeepWriter-8B model—a Qwen3-8B-parameter base model—was fine-tuned using DeepWriting-20K in conjunction with other public thinking datasets. Empirical evaluations demonstrate considerable performance advantages:

Model	LongBench-Write Score	WritingBench (Δ over baseline)	HelloBench-Creative
DeepWriter-8B	91.28	+18 average	Near GPT-4o
GPT-4o	83.1	N/A	High
Claude 3.5	89.3	N/A	High
LongWriter-8B	Reference baseline	Reference baseline	Lower

Key performance findings:

Superior open-source baseline performance: DeepWriter-8B exceeds LongWriter-8B by an average of over 18 points on WritingBench tasks
Competitive with proprietary models: On benchmarks such as HelloBench (creative subset), DeepWriter-8B is nearly on par with GPT-4o
Exceeding state-of-the-art in targeted metrics: On LongBench-Write, the model achieves a score of 91.28, outperforming both GPT-4o (83.1) and Claude 3.5 (89.3)

This suggests that training on structured, deep reasoning traces confers pronounced advantages in long-range coherence and creativity, partially bridging the gap between open-source and proprietary LLMs.

5. Significance, Limitations, and Research Context

DeepWriting-20K represents one of the first large-scale, open datasets to focus explicitly on human-like deep reasoning for open-ended and creative tasks, without dependence on RL or teacher model distillation. Its methodological approach—backward synthesis using perplexity as an objective—provides a scalable alternative to existing paradigms.

Key advantages include:

Data efficiency: The approach offers scalable, gradient-free synthesis, reducing computational overhead compared to large-scale RL or multi-stage distillation
Quality-linked structure: Intermediate reasoning steps are evaluated and refined directly by their impact on solution coherence
Open-ended task support: Extensive task and genre coverage underpins the dataset's utility in research settings demanding both technical structure and creativity

Overall, DeepWriting-20K provides the foundation for new research directions in modeling human-like reasoning and planning for generative LLMs, as evidenced by quantitative gains in recent benchmarks. A plausible implication is that this dataset may serve as a reference point for subsequent datasets and modeling efforts focused on creativity, long-context reasoning, and stepwise thought processes in natural language generation (Wang et al., 7 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Reverse-Engineered Reasoning for Open-Ended Generation (2025)

Follow Topic

Get notified by email when new papers are published related to DeepWriting-20K Dataset.