DeepWriter-8B: Fact-Grounded Writing AI
- DeepWriter-8B is an 8-billion-parameter fact-grounded multimodal writing assistant that integrates offline hierarchical knowledge retrieval, deep reasoning, and reflective composition for domain-accurate long-form content.
- It leverages both encoder–decoder and decoder-only architectures to deliver focused, domain-specific outputs and robust open-ended generation.
- Employing multimodal retrieval and iterative reflective drafting, DeepWriter-8B enhances citation accuracy and long-range narrative coherence in both creative and specialized tasks.
DeepWriter-8B is an 8-billion-parameter, fact-grounded multimodal writing assistant that integrates offline hierarchical knowledge retrieval, deep reasoning trajectory learning, and reflective composition to produce coherent, domain-accurate long-form content. Developed as both an encoder–decoder (for domain-specific writing with offline corpora) and decoder-only (for open-ended generation via reverse-engineered reasoning), DeepWriter-8B demonstrates robust performance in specialized domains and creative tasks, outperforming several open-source and proprietary baselines (Mao et al., 14 Jul 2025, Wang et al., 7 Sep 2025).
1. Model Architectures and Pipeline Designs
DeepWriter-8B comprises two core implementations: a modular encoder–decoder (for knowledge-grounded writing) (Mao et al., 14 Jul 2025) and a decoder-only reasoning-focused model (for open-ended generation) (Wang et al., 7 Sep 2025).
Encoder–Decoder Structure
- Backbone: 48-layer Transformer, hidden size 6,144, 48 attention heads (6.8 B parameters), shared encoder–decoder parameters.
- Task Decomposition & Outline Generator: 400 M parameters, 2-layer adapter stack on encoder, 128-dim classification, 256-dim span-selection.
- Multimodal Retriever & Reranker: 500 M parameters, GME-based dual-tower encoders (text and vision), cross-modal reranking head (2 Transformer layers, hidden size 2,048).
- Section Composer with Reflection: 300 M parameters, adapters every 4 Transformer layers for draft generation, 2-layer MLP discriminator for reflective critique.
Decoder-Only Structure
- Backbone: 32-layer Transformer (Qwen3-8B-Base), hidden size 4096, 32 attention heads, feed-forward inner dimension 16,384, layer-norm, Rotary Position Embeddings, full fine-tune with no adapters or LoRA modules (Wang et al., 7 Sep 2025).
Pipeline Overview
Query flows through rewriting → outline generation → hierarchical retrieval → section drafting → reflection module → final assembly (document, insertion of multimodal elements, and citations) (Mao et al., 14 Jul 2025). Open-ended reasoning-driven generation first emits a > block to encode deep reasoning before final <answer> output (Wang et al., 7 Sep 2025).
2. Offline Knowledge Base and Hierarchical Retrieval
For domain-grounded writing, DeepWriter-8B relies on a curated offline corpus and a hierarchical representation:
Corpus Structure
- Document level: metadata (year, title, domain)
- Page level: each PDF page as a “page” item
- Chunk level: ~200-token text splits + captions for tables/images
Indexing and Embeddings
- Milvus stores embeddings for each chunk/page/document; inverted keyword index for auxiliary lookup.
Hierarchical Scoring Functions
Ranking proceeds top- documents top- pages top- chunks with retrieval per query embedding .
Pseudocode: Advantages include robust consistency, fine-grained relevance, and high citation accuracy vs. web search-based approaches (Mao et al., 14 Jul 2025).
3. Multimodal Retrieval, Reranking, and Fusion Algorithms
DeepWriter-8B incorporates both textual and visual modalities for information retrieval and document synthesis.
Retrieval
- Text tower: cosine similarity embedding for chunk/page retrieval.
- Vision tower: Qwen2.5-VL for image/table embeddings.
Cross-Modal Reranking
- Two-layer Transformer reranker, optimizing contrastive ranking loss
where is the learned scoring head.
Fusion Strategies
- Early fusion: concatenate modality embeddings at each decoder layer, cross-attention applied as
- Late fusion: relevance matrix for optimal placement of visuals with paragraphs.
This enables robust integration of tables and figures, elevating document quality and factuality (Mao et al., 14 Jul 2025).
4. Reflective Section Composition and Reverse-Engineered Reasoning
Section-Level Reflective Drafting
For specialized writing tasks:
- Drafting: Each outline section is drafted using top-ranked content and prior context.
- Reflection loop: Critiques draft against retrieved facts using a discriminator
Formally: This critical feedback loop ensures improved factuality, style, and citation quality (Mao et al., 14 Jul 2025).
1 2 3 4 5 errors = Critic(d_j, retrieved_content) if errors ≠ ∅: d_j' = Revise(d_j, errors) else: d_j' = d_jReverse-Engineered Reasoning (REER)
For open-ended generation:
- REER constructs stepwise “thinking” trajectories for every (query, answer) pair, using a gradient-free local-search over candidate edits to minimize perplexity (PPL), iteratively:
Candidate prompts inject meta-structure and reflection tokens (e.g., “Hmm…”, “Maybe…”) to model human-like chains of thought.
The training regime relies on formatting with mandatory
<think>...and<answer>...</answer>blocks, explicitly separating reasoned planning from surface generation (Wang et al., 7 Sep 2025).
5. Training Regimes, Datasets, and Optimization
Knowledge-Grounded Fine-Tuning
- Backbone initialized from Qwen2-7B.
- Adapters and heads fine-tuned on WTO WTR corpus (23 PDFs).
- Joint multimodal objectives: with , , AdamW, lr (1k warmup, 10k decay), batch size 32, sequence max 4000 tokens.
Reverse-Engineered Reasoning Training
- Fine-tuning 32-layer Qwen3-8B-Base on DeepWriting-20K (20,000 reverse-engineered thought traces) plus 17,000 public deep reasoning examples, for a total of 37,000 instances.
- Regime: 3 epochs, lr , batch size 96, AdamW (=0.9, =0.999, weight decay 0.01).
- Objective: standard autoregressive cross-entropy over system prompt,
<think>, and<answer>blocks.
This synthesis enables both strong factual grounding (offline corpus) and deep multi-step open-ended reasoning (REER dataset) (Mao et al., 14 Jul 2025, Wang et al., 7 Sep 2025).
6. Empirical Evaluations and Ablation Studies
Financial Domain Evaluation
On the WTO WTR dataset (real-world report topics, judged by Prometheus2-7B):
| System | Interest | Coherence | Relevance | Coverage | Citation Accuracy |
|---|---|---|---|---|---|
| Qwen-Plus | 3.5 | 3.7 | 3.5 | 3.8 | 78% |
| STORM | 3.8 | 3.9 | 3.6 | 3.9 | 81% |
| CO-STORM | 3.9 | 3.8 | 3.7 | 3.9 | 83% |
| DeepWriter-8B | 4.1 | 4.0 | 3.8 | 4.2 | 95% |
DeepWriter-8B achieves higher factual accuracy and quality than larger, web-search-based systems (Mao et al., 14 Jul 2025).
Open-Ended Generation Benchmarks
| Model | LB | HB-A | HB-B | WB-A | WB-B | WB-C | WB-D | WB-E | WB-F |
|---|---|---|---|---|---|---|---|---|---|
| GPT-4o | 83.1 | 83.7 | 87.6 | 74.40 | 73.42 | 74.38 | 77.91 | 75.86 | 78.08 |
| Claude 3.5 | 89.3 | 82.9 | 88.3 | 59.05 | 57.68 | 56.32 | 59.36 | 62.00 | 67.70 |
| Claude 3.7 | 97.8 | 83.9 | 93.2 | 78.24 | 77.93 | 76.51 | 79.37 | 79.26 | 80.88 |
| LongWriter-8B | 76.5 | 80.1 | 82.6 | 57.97 | 53.92 | 49.08 | 52.08 | 52.99 | 52.08 |
| DeepWriter-8B | 91.28 | 82.64 | 87.48 | 72.20 | 71.76 | 70.57 | 70.57 | 73.65 | 72.29 |
DeepWriter-8B achieves exceptional long-range coherence (LB), nearly matching or exceeding GPT-4o and Claude variants on narrative and professional writing tasks (Wang et al., 7 Sep 2025).
Ablation Results
- Removal of synthesized reasoning traces, iterative search, or reflection tokens reduces domain and creative performance, confirming the criticality of these innovations.
- Both long and short trajectory traces contribute distinct value across technical and artistic task domains.
7. Implementation, Inference, and Practical Notes
- Inference latency is approximately 8 ms/token on a single A100 GPU, yielding ~4 s for a typical 500-token response (Wang et al., 7 Sep 2025).
- Fine-tuning guidelines for DeepWriter-8B: 3 epochs, lr , global batch size 96.
- Prompts at inference require explicit separation: expert writer instruction, mandatory
<think>then<answer>block, reliably triggering reflective planning. - Offline KB enables robust domain-specific writing with reliable citation, while REER trajectories foster human-like, multi-step creative reasoning.
This suggests that DeepWriter-8B, via modular pipeline design and reverse-engineered trajectory training, provides scalable, efficient, and verifiable text generation for both closed-domain and open-ended applications, rivaling much larger proprietary models and surpassing open-source 8 B baselines (Mao et al., 14 Jul 2025, Wang et al., 7 Sep 2025).