PaperWritingBench: Modular Document Generation
- PaperWritingBench is a suite of modular frameworks that benchmark, automate, and refine long-form document planning using hierarchical memory and task decomposition.
- It integrates agent architectures like AgentLite and DualGraph to structure reasoning, enable recursive delegation, and preserve context across multi-step scholarly workflows.
- The framework’s rigorous evaluation protocols, modular design, and explicit memory-driven planning offer actionable insights for improving automated scholarly document creation.
PaperWritingBench refers to a suite of task-oriented, modular, and extensible frameworks, models, and memory architectures designed to benchmark, automate, and scrutinize agentic long-form document planning and generation. These systems span diverse media and task domains—including scientific survey outline drafting, deep research report writing, scientific workflow automation, story and video structuring, and visual document decomposition—emphasizing experimental rigor and fine-grained evaluation in the generation of complex scholarly artifacts.
1. Agent Architectures for Document Planning and Reasoning
PaperWritingBench comprises several paradigms for agentic reasoning and document assembly, with a recurrent focus on hierarchical, role-driven, memory-augmented agents.
- AgentLite implements an extensible agent core that separates prompt generation, memory, actions, and LLM API logic. It distinguishes Individual Agents (single-step reasoning/execution) from Manager Agents (handling task decomposition and multi-agent orchestration), with a TaskPackage model formalizing instruction handoff and memory flow. Reasoning strategies (Chain-of-Thought, ReAct, Reflection) are abstracted as swappable Action classes, making the architectural insertion of new strategies or tools modular (Liu et al., 2024).
- DualGraph defines two co-evolving memory substrates: the Outline Graph (OG), encoding document structure and section-level evidence/citations, and the Knowledge Graph (KG), encoding semantic entity–relation facts extracted from evidence banks. Roles and communication protocols between agent components are rigorously specified, with KG-driven bridges enabling targeted query generation and refinement of document drafts (Shi et al., 14 Feb 2026).
- SurveyForge's Outline Agent operates via retrieval-augmented, hierarchical LLM prompting, leveraging both a Research Paper DB and a Survey Outline DB. Modular interfaces manage memory bundles per outline node, facilitating transfer of localized context to downstream content agents (Yan et al., 6 Mar 2025).
- SlideAgent and VSENet provide further evidence of best practices by disaggregating reasoning over complex multi-modal or multi-page documents into specialized, hierarchical agents targeting global, page, and element levels (Jin et al., 30 Oct 2025, Lv et al., 2022).
2. Task Decomposition and Modular Workflow Design
A central tenet across PaperWritingBench systems is explicit task decomposition at runtime, typically realized through a manager-worker (or manager–team) architecture:
- Manager Agents initiate decomposition by invoking LLM-based planning routines that split global tasks (e.g., “write a survey on Graph Neural Networks”) into well-ordered subtasks, each captured as a TaskPackage and delegated to specialized subagents or action handlers.
- Recursive Delegation is standard: ManagerAgents may call other ManagerAgents, permitting deep or arbitrarily nested hierarchies reflecting complex real-world scholarly workflows (Liu et al., 2024).
- Explicit Data & Control Flow is traced through action–observation chains, persistent subtask packages, and memory updates, ensuring all context required for reproducibility and auditability is captured.
3. Reasoning Strategies and Memory-Driven Planning
Outline and document agents in PaperWritingBench incorporate diverse reasoning and memory paradigms:
- Action-Based Reasoning: Reasoning patterns such as Chain-of-Thought, ReAct (interleaving thinking and tool use), and Reflection/self-critique are implemented directly as selectable Action subclasses. The agent’s action list defines its supported reasoning repertoire.
- Memory Inclusion in Prompting: Past action–observation pairs, decomposition traces, and evidence citations are persistently stored and automatically included in the prompt context, ensuring continuity of reasoning and enabling retrieval-augmented generation. In DualGraph, the explicit differentiation between structural (OG) and epistemic (KG) memory directly informs targeted gap identification and search query formulation.
- Graph-Driven Exploration: In DualGraph, search continues until OG- and KG-driven early-stop criteria are met, and scoring functions (e.g., Score_enrich for knowledge edges, composite relevance scores) drive query and refinement prioritization (Shi et al., 14 Feb 2026).
- Hierarchical Context Preservation: Detailed outliners (e.g., DOC) generate tree-structured plans, with discriminators and control mechanisms ensuring the drafting phase respects the established plan at all levels (Yang et al., 2022).
4. Evaluation Protocols and Benchmarks
PaperWritingBench emphasizes multidimensional, competitive evaluation strategies specific to scholarly writing:
- SurveyBench (SurveyForge) uses 100 human-written survey papers for win-rate evaluation; outline evaluation employs the SAM-O metric aggregating topic uniqueness, structure, clarity, and logic (Yan et al., 6 Mar 2025).
- DeepResearch Bench, DeepResearchGym, DeepConsult serve as standard benchmarks for DualGraph. Metrics include RACE (report quality: comprehensiveness, insight, instruction-following, readability), citation accuracy, and effective citations per task. LLM-as-judge protocols yield robust, high-resolution comparative analysis (Shi et al., 14 Feb 2026).
- Ablation and Human Preference Studies consistently demonstrate the critical value of architectural choices (KG memory, reflection, decomposition); for instance, DualGraph achieves a RACE score of 53.08 (matching or slightly exceeding Gemini-2.5-Pro), and SurveyForge outlines win 74–75% of human and LLM-based pairwise comparisons with previous methods.
5. Practical Implementations and Code Patterns
All major PaperWritingBench systems provide end-to-end, runnable pseudocode or full code snippets:
- AgentLite usage typically entails defining Action classes (e.g., Tool calls, Think, Reflect), instantiating BaseAgent with desired actions and roles, and wrapping with a ManagerAgent for subtask decomposition and orchestration. Runtime execution is idiomatic:
manager.run(main_instruction)triggers the full pipeline, with stepwise handoff, memory updating, and output assembly (Liu et al., 2024). - SurveyForge pseudocode formalizes outline generation as recursive, retrieval-augmented LLM prompting. Candidates are filtered/scored by explicit heuristics (coverage, coherence, depth), and memory passing between agents is localized to subtree contexts of the outline (Yan et al., 6 Mar 2025).
- Graph-based APIs (DualGraph, El Agente Gráfico) formalize the mapping from computational object graphs to memory graphs. Python class structures map bijectively to OWL classes, with runtime state updates, audit logs, and tool orchestration reflected canonically in external knowledge graphs (Bai et al., 19 Feb 2026).
6. Design Insights, Limitations, and Best Practices
- Separation of Structure and Knowledge: DualGraph demonstrates that disentangling outline structure from accumulated knowledge ensures scalable, targeted exploration and prevents lost-context failures typical of linear "search-then-generate" agents (Shi et al., 14 Feb 2026).
- Memory-Driven Reasoning: Memory objects, whether persistent action-observation tuples or semantic graphs, are actively incorporated into many pipeline stages, supporting rational exploration signals (structural holes, bridge edges, weakly supported claims) and reproducibility of agentic trajectories.
- Modularity and Extensibility: AgentLite and related toolkits premise all extensibility on clean code separation—new reasoning modes, memory architectures, or external tools are integrated without altering core agent logic, and new agents or reasoning strategies merely subclass BaseAction or ManagerAgent (Liu et al., 2024).
- Robustness Considerations: Empirical assessments recognize that brittle or noisy intermediate plans degrade downstream generation quality (e.g., two-stage hierarchical generation models sensitive to outline quality in (Drissi et al., 2018)); agent design must account for noise robustness, end-to-end retraining protocols, and principled early-stopping.
- Domain Adaptation: In SurveyForge and SlideAgent, adaptation to domain-specific document types (e.g., legal, medical, visual slides) is via retrieval of representative outline/section exemplars or layout augmenters, not by hard-coding.
7. Representative Implementations
The following table summarizes key frameworks and unique features found in PaperWritingBench-related works:
| Framework | Architectural Principle | Key Innovations/Features |
|---|---|---|
| AgentLite (Liu et al., 2024) | Modular agent, action-unified reasoning | Manager/Individual agents; actions=reasoning steps |
| DualGraph (Shi et al., 14 Feb 2026) | Outline/Knowledge dual memory | Iterative co-evolution, KG-driven queries, OG citations |
| SurveyForge (Yan et al., 6 Mar 2025) | Outline-first retrieval-driven agent | Human-written outline DB, heuristic reranking, SAM-O |
| El Agente Gráfico (Bai et al., 19 Feb 2026) | Type-safe execution, graph mapping | Python↔OWL mapping, provenance, token-efficient context |
| DOC (Yang et al., 2022) | Detailed outline, controlled drafting | Outliner + controller, FUDGE token-level constraints |
| SlideAgent (Jin et al., 30 Oct 2025) | Hierarchical multimodal reasoning | Global/page/element agents, query-agnostic knowledge |
| VSENet (Lv et al., 2022) | Span-rewrite, visual-text fusion | BERT+visual gated fusion, CRF+LaserTagger, DuVOG corpus |
All code and agent orchestration patterns are fully specified within each system’s original documentation and pseudocode, enabling rapid reproduction and extension for academic benchmarking or applied research scenarios.