CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation

Published 18 Apr 2026 in cs.MA | (2604.17072v1)

Abstract: The autonomous synthesis of deep research reports represents a critical frontier for LLMs, demanding sophisticated information orchestration and non-linear narrative logic. Current approaches rely on rigid predefined linear workflows, which cause error accumulation, preclude global restructuring from subsequent insights, and ultimately limit in-depth multimodal fusion and report quality. We propose CogGen, a Cognitively inspired recursive framework for deep research report Generation. Leveraging a Hierarchical Recursive Architecture to simulate cognitive writing, CogGen enables flexible planning and global restructuring. To extend this recursivity to multimodal content, we introduce Abstract Visual Representation (AVR): a concise intent-driven language that iteratively refines visual-text layouts without pixel-level regeneration overhead. We further present CLEF, a Cognitive Load Evaluation Framework, and curate a new benchmark from Our World in Data (OWID). Extensive experiments show CogGen achieves state-of-the-art results among open-source systems, generating reports comparable to professional analysts' outputs and surpassing Gemini Deep Research. Our code and dataset are available at https://github.com/NJUNLP/CogGen.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces CogGen, a recursive architecture that enhances research report synthesis by integrating text and visual elements.
It outlines a novel method using Planner, Writer, and Reviewer agents to dynamically restructure content and prevent error propagation.
Experimental results on the OWID and WildSeek datasets show state-of-the-art performance, achieving superior depth, organization, and alignment.

CogGen: A Cognitively Inspired Recursive Framework for Deep Research Report Generation

Motivation and Problem Statement

The autonomous generation of research reports with deep analytical structure and multimodal evidence has become a significant challenge for LLM-based systems. Existing systems are limited by rigid, predefined, linear workflows that preclude recursive restructuring, leading to error propagation, brittle global logic, and superficial or disjoint integration of textual and visual content. This lack of non-linear writing and revision, at both the macro (whole report) and micro (sectional) levels, fails to reflect the recursive, iterative, and integrative strategies employed by human experts.

Furthermore, current approaches to multimodal synthesis typically decouple text and visualization pipelines, resulting in poor alignment and limited synergy. Visual elements are often generated in isolation from the accompanying text, degrading the overall coherence and cognitive efficacy of the produced report.

CogGen Framework Design

CogGen addresses these deficiencies through a cognitively inspired, hierarchical recursive architecture. The framework operationalizes writing theories such as the Cognitive Process Theory (Flower & Hayes), incorporating both macro-level and micro-level recursive refinement. The architecture consists of three specialized agent modules:

Planner Agent ( $A_p$ ): Conducts information retrieval and constructs a mutable global outline and knowledge base. It dynamically adapts the structure in response to feedback and evolving content.
Writer Agent ( $A_w$ ): Generates report sections (including both text and high-level visual intent) in parallel under the context of the current plan.
Reviewer Agent ( $A_r$ ): Executes both real-time monitoring and post-hoc critique, emitting feedback that drives further structural or content revisions, ensuring improvements are only accepted if they exceed a monotonic improvement threshold.

The macro-cognitive loop orchestrates global outline and content rewriting, supporting 'backwards restructuring' wherein downstream insights can trigger upstream modifications. At the micro level, parallel section writing threads each execute their own search–replan–write–review cycles, isolated in local caches to prevent contamination of unrelated sections and to sidestep the oscillation commonly seen in naïve recursive edits.

Abstract Visual Representation (AVR)

A central contribution is the introduction of the Abstract Visual Representation (AVR) for integrating multimodal content. Instead of entangling narrative flow with direct chart generation or visual code, the Writer Agent specifies visual requirements in a high-level, semantic, compact schema (e.g., chart type, axes, purpose, and data reference), offloading the rendering to specialized agents after global content stabilization. This mitigates dual-task interference and fosters strong text-visual semantic alignment with minimal cognitive burden during authoring.

Rendering and Verification Pipeline

The Render Agent translates AVR schemas into executable visualization code using declarative libraries (e.g., ECharts, Mermaid), with final assets rendered in a headless browser. Post-render verification audits rendered visuals against knowledge base data to suppress hallucinations, a mechanism made efficient by the decoupling properties of AVR.

Evaluation Methodology

CogGen is evaluated using the newly curated OWID dataset (derived from the Our World in Data repository) of professional multimodal research reports and the WildSeek open-domain report generation benchmark. The Cognitive Load Evaluation Framework (CLEF), grounded in Cognitive Load Theory and the Cognitive Theory of Multimedia Learning, is introduced as a principled evaluation metric. CLEF assesses reports across five orthogonal axes: Organization, Depth, Relevance, Alignment, and Synergy, operationalizing 11 of Mayer’s 14 educational multimedia principles.

Baselines include STORM, Co-STORM, WriteHere, and Multimodal DeepResearcher, as well as commercial outputs from Gemini Deep Research.

Experimental Results

CogGen demonstrates state-of-the-art performance relative to both open-source and commercial baselines:

On the OWID dataset, CogGen matches human analyst benchmarks on average CLEF score (0.4992 vs. human 0.4997), outperforming all other models, especially in content depth and synergy.
On the WildSeek dataset, CogGen surpasses Gemini Deep Research across all evaluation dimensions, with particularly strong margins in visual-text alignment and multimodal synergy.
In ablation studies, removing the recursive review mechanism or employing a post-hoc (two-stage) visual integration strategy results in marked declines in global organization, depth, and alignment, confirming the necessity of recursive planning and synchronized multimodal reasoning.
Human evaluations (blinded, pairwise) yield a 75% win rate over Gemini Deep Research and up to 95% win rate over open-source baselines on critical report quality dimensions.

In all settings, CogGen exhibits superior factual precision in citation analysis and claim-level verification compared to both open and commercial baselines.

Architectural and Theoretical Implications

CogGen's recursive, hierarchical approach enables non-linear global restructuring and retroactive correction, essential for expert-grade report synthesis and long-form reasoning tasks. The framework’s parallelized deferred update policy and reviewer-gated feedback yield empirically validated stability and high planning-to-writing revision ratios, avoiding the complexity explosion typical of serial backtracking methods. The AVR design, by decoupling visual intent from rendering, not only fosters high-quality semantic integration but also facilitates post-render data verification.

The findings demonstrate that cognitive architectures incorporating macro- and micro-level recursion, review gating, and multimodal offloading can elevate LLM-driven research agents beyond fixed linear execution. Furthermore, such architectures align well with foundational models of human writing and learning, bridging the mechanistic gap between algorithmic systems and expert human writers.

Practical Limitations and Outlook

The framework's recursive mechanisms and synchronous multimodal planning introduce significant computational overhead relative to linear baselines, predominantly due to a retrieval-intensive pre-processing pipeline. While report depth and factuality benefit from full-document summarization, further advances in lightweight retrieval and summarization models may be needed to drive practical deployment efficiency.

Current rendering is restricted to declarative visualization libraries to ensure robustness, so expressiveness lags behind bespoke, hand-crafted visualization capabilities. Future system iterations could integrate more sophisticated template generation for imperative and interactive visuals, as well as learned heuristics for efficient review gating.

From a theoretical standpoint, CogGen provides a blueprint for future research on agentic LLM systems capable of expert-level reasoning, planning, and synthesis—suggesting a path forward not just for research report automation, but for increasingly general AI writing and knowledge integration tasks demanding recursive, multimodal, and human-like cognitive strategies.

Conclusion

CogGen represents a substantial advancement in report generation architectures, blending cognitive theory, hierarchical recursion, and multimodal intent decoupling. The experimental evidence indicates that recursive, cognitively inspired agent frameworks with explicit review and alignment mechanisms are essential for robust deep research synthesis, supporting both practical deployment in professional domains and informing the next generation of autonomous research agents.

Markdown Report Issue