Papers
Topics
Authors
Recent
2000 character limit reached

DeepWriter-8B: Fact-Grounded Writing AI

Updated 13 January 2026
  • DeepWriter-8B is an 8-billion-parameter fact-grounded multimodal writing assistant that integrates offline hierarchical knowledge retrieval, deep reasoning, and reflective composition for domain-accurate long-form content.
  • It leverages both encoder–decoder and decoder-only architectures to deliver focused, domain-specific outputs and robust open-ended generation.
  • Employing multimodal retrieval and iterative reflective drafting, DeepWriter-8B enhances citation accuracy and long-range narrative coherence in both creative and specialized tasks.

DeepWriter-8B is an 8-billion-parameter, fact-grounded multimodal writing assistant that integrates offline hierarchical knowledge retrieval, deep reasoning trajectory learning, and reflective composition to produce coherent, domain-accurate long-form content. Developed as both an encoder–decoder (for domain-specific writing with offline corpora) and decoder-only (for open-ended generation via reverse-engineered reasoning), DeepWriter-8B demonstrates robust performance in specialized domains and creative tasks, outperforming several open-source and proprietary baselines (Mao et al., 14 Jul 2025, Wang et al., 7 Sep 2025).

1. Model Architectures and Pipeline Designs

DeepWriter-8B comprises two core implementations: a modular encoder–decoder (for knowledge-grounded writing) (Mao et al., 14 Jul 2025) and a decoder-only reasoning-focused model (for open-ended generation) (Wang et al., 7 Sep 2025).

Encoder–Decoder Structure

  • Backbone: 48-layer Transformer, hidden size 6,144, 48 attention heads (6.8 B parameters), shared encoder–decoder parameters.
  • Task Decomposition & Outline Generator: 400 M parameters, 2-layer adapter stack on encoder, 128-dim classification, 256-dim span-selection.
  • Multimodal Retriever & Reranker: 500 M parameters, GME-based dual-tower encoders (text and vision), cross-modal reranking head (2 Transformer layers, hidden size 2,048).
  • Section Composer with Reflection: 300 M parameters, adapters every 4 Transformer layers for draft generation, 2-layer MLP discriminator for reflective critique.

Decoder-Only Structure

  • Backbone: 32-layer Transformer (Qwen3-8B-Base), hidden size 4096, 32 attention heads, feed-forward inner dimension 16,384, layer-norm, Rotary Position Embeddings, full fine-tune with no adapters or LoRA modules (Wang et al., 7 Sep 2025).

Pipeline Overview

Query flows through rewriting → outline generation → hierarchical retrieval → section drafting → reflection module → final assembly (document, insertion of multimodal elements, and citations) (Mao et al., 14 Jul 2025). Open-ended reasoning-driven generation first emits a > block to encode deep reasoning before final <answer> output (Wang et al., 7 Sep 2025).

2. Offline Knowledge Base and Hierarchical Retrieval

For domain-grounded writing, DeepWriter-8B relies on a curated offline corpus K\mathcal{K} and a hierarchical representation:

Corpus Structure

  • Document level: metadata (year, title, domain)
  • Page level: each PDF page as a “page” item
  • Chunk level: ~200-token text splits + captions for tables/images

Indexing and Embeddings

  • Milvus stores embeddings for each chunk/page/document; inverted keyword index for auxiliary lookup.

Hierarchical Scoring Functions

Schunk(di)=cos(q,di)S_{chunk}(d_i) = \cos(q, d_i)

Spage(pj)=maxipagejSchunk(di)S_{page}(p_j) = \max_{i \in page_j} S_{chunk}(d_i)

Sdoc(Dk)=maxjDkSpage(pj)S_{doc}(D_k) = \max_{j \in D_k} S_{page}(p_j)

Ranking proceeds top-KK documents \rightarrow top-K1K_1 pages \rightarrow top-K2K_2 chunks with retrieval per query embedding qq.

Pseudocode: rank_chunks(q)=topk{di:Schunk(di)}\text{rank\_chunks}(q) = \mathrm{topk}\{d_i : S_{chunk}(d_i)\} Advantages include robust consistency, fine-grained relevance, and high citation accuracy vs. web search-based approaches (Mao et al., 14 Jul 2025).

3. Multimodal Retrieval, Reranking, and Fusion Algorithms

DeepWriter-8B incorporates both textual and visual modalities for information retrieval and document synthesis.

Retrieval

  • Text tower: cosine similarity embedding for chunk/page retrieval.
  • Vision tower: Qwen2.5-VL for image/table embeddings.

Cross-Modal Reranking

  • Two-layer Transformer reranker, optimizing contrastive ranking loss

Lrank=poslogexp(f(q,v+))exp(f(q,v+))+vexp(f(q,v))L_{rank} = -\sum_{pos} \log \frac{\exp(f(q, v^+))}{\exp(f(q, v^+)) + \sum_{v^-} \exp(f(q, v^-))}

where f(,)f(\cdot, \cdot) is the learned scoring head.

Fusion Strategies

  • Early fusion: concatenate modality embeddings at each decoder layer, cross-attention applied as

Attn(Q,K,V)=softmax(QKd)V\mathrm{Attn}(Q,K,V)=\mathrm{softmax}\left(\tfrac{QK^\top}{\sqrt{d}}\right)V

  • Late fusion: relevance matrix Rk,i=cos(e(vk),e(pi))R_{k,i} = \cos(e(v_k), e(p_i)) for optimal placement of visuals with paragraphs.

This enables robust integration of tables and figures, elevating document quality and factuality (Mao et al., 14 Jul 2025).

4. Reflective Section Composition and Reverse-Engineered Reasoning

Section-Level Reflective Drafting

For specialized writing tasks:

  • Drafting: Each outline section sjs_j is drafted using top-ranked content and prior context.
  • Reflection loop: Critiques draft djd_j against retrieved facts using a discriminator
    1
    2
    3
    4
    5
    
    errors = Critic(d_j, retrieved_content)
    if errors  :
        d_j' = Revise(d_j, errors)
    else:
        d_j' = d_j
    Formally: dj(t+1)={Revise(dj(t),Critique(dj(t),Rj))if Critique dj(t)otherwised_j^{(t+1)} = \begin{cases} \mathrm{Revise}(d_j^{(t)}, \mathrm{Critique}(d_j^{(t)}, \mathcal{R}_j)) & \text{if } \mathrm{Critique} \neq \varnothing \ d_j^{(t)} & \text{otherwise} \end{cases} This critical feedback loop ensures improved factuality, style, and citation quality (Mao et al., 14 Jul 2025).

Reverse-Engineered Reasoning (REER)

For open-ended generation:

  • REER constructs stepwise “thinking” trajectories zz^* for every (query, answer) pair, using a gradient-free local-search over candidate edits to minimize perplexity (PPL), iteratively: z=argminzZPPL(yx,z)z^* = \arg \min_{z \in \mathcal{Z}}\, PPL(y | x, z)

PPL(yx,z)=exp[1yi=1ylogp(yix,z,y<i)]PPL(y|x,z) = \exp\left[-\frac{1}{|y|} \sum_{i=1}^{|y|}\log p(y_i | x, z, y_{<i})\right]

Candidate prompts inject meta-structure and reflection tokens (e.g., “Hmm…”, “Maybe…”) to model human-like chains of thought.

The training regime relies on formatting with mandatory <think>... and <answer>...</answer> blocks, explicitly separating reasoned planning from surface generation (Wang et al., 7 Sep 2025).

5. Training Regimes, Datasets, and Optimization

Knowledge-Grounded Fine-Tuning

  • Backbone initialized from Qwen2-7B.
  • Adapters and heads fine-tuned on WTO WTR corpus (23 PDFs).
  • Joint multimodal objectives: L=LLM+αLrank+βLalign\mathcal{L} = \mathcal{L}_{LM} + \alpha \mathcal{L}_{rank} + \beta \mathcal{L}_{align} with α=1.0\alpha=1.0, β=0.5\beta=0.5, AdamW, lr 1×1051 \times 10^{-5} (1k warmup, 10k decay), batch size 32, sequence max 4000 tokens.

Reverse-Engineered Reasoning Training

  • Fine-tuning 32-layer Qwen3-8B-Base on DeepWriting-20K (20,000 reverse-engineered thought traces) plus 17,000 public deep reasoning examples, for a total of 37,000 instances.
  • Regime: 3 epochs, lr 2×1052 \times 10^{-5}, batch size 96, AdamW (β1\beta_1=0.9, β2\beta_2=0.999, weight decay 0.01).
  • Objective: standard autoregressive cross-entropy over system prompt, <think>, and <answer> blocks.

This synthesis enables both strong factual grounding (offline corpus) and deep multi-step open-ended reasoning (REER dataset) (Mao et al., 14 Jul 2025, Wang et al., 7 Sep 2025).

6. Empirical Evaluations and Ablation Studies

Financial Domain Evaluation

On the WTO WTR dataset (real-world report topics, judged by Prometheus2-7B):

System Interest Coherence Relevance Coverage Citation Accuracy
Qwen-Plus 3.5 3.7 3.5 3.8 78%
STORM 3.8 3.9 3.6 3.9 81%
CO-STORM 3.9 3.8 3.7 3.9 83%
DeepWriter-8B 4.1 4.0 3.8 4.2 95%

DeepWriter-8B achieves higher factual accuracy and quality than larger, web-search-based systems (Mao et al., 14 Jul 2025).

Open-Ended Generation Benchmarks

Model LB HB-A HB-B WB-A WB-B WB-C WB-D WB-E WB-F
GPT-4o 83.1 83.7 87.6 74.40 73.42 74.38 77.91 75.86 78.08
Claude 3.5 89.3 82.9 88.3 59.05 57.68 56.32 59.36 62.00 67.70
Claude 3.7 97.8 83.9 93.2 78.24 77.93 76.51 79.37 79.26 80.88
LongWriter-8B 76.5 80.1 82.6 57.97 53.92 49.08 52.08 52.99 52.08
DeepWriter-8B 91.28 82.64 87.48 72.20 71.76 70.57 70.57 73.65 72.29

DeepWriter-8B achieves exceptional long-range coherence (LB), nearly matching or exceeding GPT-4o and Claude variants on narrative and professional writing tasks (Wang et al., 7 Sep 2025).

Ablation Results

  • Removal of synthesized reasoning traces, iterative search, or reflection tokens reduces domain and creative performance, confirming the criticality of these innovations.
  • Both long and short trajectory traces contribute distinct value across technical and artistic task domains.

7. Implementation, Inference, and Practical Notes

  • Inference latency is approximately 8 ms/token on a single A100 GPU, yielding ~4 s for a typical 500-token response (Wang et al., 7 Sep 2025).
  • Fine-tuning guidelines for DeepWriter-8B: 3 epochs, lr 2×1052 \times 10^{-5}, global batch size 96.
  • Prompts at inference require explicit separation: expert writer instruction, mandatory <think> then <answer> block, reliably triggering reflective planning.
  • Offline KB enables robust domain-specific writing with reliable citation, while REER trajectories foster human-like, multi-step creative reasoning.

This suggests that DeepWriter-8B, via modular pipeline design and reverse-engineered trajectory training, provides scalable, efficient, and verifiable text generation for both closed-domain and open-ended applications, rivaling much larger proprietary models and surpassing open-source 8 B baselines (Mao et al., 14 Jul 2025, Wang et al., 7 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DeepWriter-8B.