Deep-Reporter: Deep Research for Grounded Multimodal Long-Form Generation

Published 12 Apr 2026 in cs.CL, cs.AI, and cs.IR | (2604.10741v1)

Abstract: Recent agentic search frameworks enable deep research via iterative planning and retrieval, reducing hallucinations and enhancing factual grounding. However, they remain text-centric, overlooking the multimodal evidence that characterizes real-world expert reports. We introduce a pressing task: multimodal long-form generation. Accordingly, we propose Deep-Reporter, a unified agentic framework for grounded multimodal long-form generation. It orchestrates: (i) Agentic Multimodal Search and Filtering to retrieve and filter textual passages and information-dense visuals; (ii) Checklist-Guided Incremental Synthesis to ensure coherent image-text integration and optimal citation placement; and (iii) Recurrent Context Management to balance long-range coherence with local fluency. We develop a rigorous curation pipeline producing 8K high-quality agentic traces for model optimization. We further introduce M2LongBench, a comprehensive testbed comprising 247 research tasks across 9 domains and a stable multimodal sandbox. Extensive experiments demonstrate that long-form multimodal generation is a challenging task, especially in multimodal selection and integration, and effective post-training can bridge the gap.

Abstract PDF Upgrade to Chat

Authors (8)

Summary

The paper introduces an agentic multimodal pipeline that combines planning, evidence retrieval, and incremental synthesis to generate cohesive, evidence-rich reports.
The paper demonstrates significant improvements in narrative coherence and citation integration through recurrent context management and relevance-aware filtering.
The paper validates its approach on the M²LongBench benchmark, showing marked gains in evidence selection precision and overall report quality.

Deep-Reporter: Agentic Multimodal Long-Form Generation Framework

Introduction

Deep-Reporter addresses the limitations of text-centric long-form generation models, introducing a unified agentic framework for multimodal long-form composition. The approach orchestrates planning, targeted retrieval of both text and visual evidence, and structured synthesis, thereby grounding lengthy, expert-grade reports in verifiable, information-dense multimodal sources. This direction responds to the emerging need for automated generation pipelines capable of replicating the evidence integration and cross-modal reasoning that characterize high-quality research artifacts and professional reports.

Figure 1: Comparison of paradigms in long-form report generation. Deep-Reporter (c) achieves high coherence and factuality by retrieving and integrating real-world visual evidence, overcoming the fragmentation and hallucinations of prior approaches.

Methodology

Agentic Multimodal Pipeline

Deep-Reporter decomposes the challenging multimodal generation task into three coordinated agentic components: Planning, Agentic Multimodal Search and Filtering, and Checklist-Guided Incremental Synthesis with Recurrent Context Management.

Sectional Planning with Dual-Granularity Checklists: A Planner decomposes the report into a hierarchical structure of sections and semantic anchors (fine- and coarse-grained), generating explicit checklists for each section to ensure comprehensive narrative coverage and factual rigor.
Agentic Multimodal Search & Filtering: For each section, Deep-Reporter executes dual-stream agentic search—narrative queries for text, and visual queries for charts, figures, and diagrams—against a multimodal retrieval backend, followed by rigorous relevance-aware filtering to isolate evidence that directly supports checklist items.
Incremental Synthesis & Recurrent Context Management: A Reporter agent incrementally composes the narrative, leveraging a recurrent context mechanism that propagates both a compressed global summary and verbatim section-level history. This design maintains long-range report coherence and smooth local transitions across extensive contexts, mitigating context window overflow and cross-section drift. Visual evidence is consistently interleaved into the narrative via citation-based transcription, enabling fine control over image placement and supporting evidence traceability.
Figure 2: Deep-Reporter architecture. Multi-agent orchestration for planning, multimodal search/filtering, and incremental report writing. Right: the data synthesis pipeline curates expert traces for post-training open-weight models.

Dataset Construction and Training

To activate agentic multimodal reasoning capabilities lacking in existing LLMs, an expert-in-the-loop curation pipeline was developed to produce 8,000 high-quality agentic traces. The process involved:

Diverse domain/task sampling and expert refinement of checklists and outlines.
Autonomous execution/trace distillation of the agentic framework to capture interleaved multimodal search and synthesis logic, including retrieval, evidence selection, and visual citation strategies.
Strict quality control, employing strong LLMs/VLMs for automatic trace filtration, verification of visual evidence alignment, detection/removal of hallucinations, and manual expert reviews, yielding a high-precision supervision corpus.

This dataset enables effective post-training of open-weight models, through both SFT and DPO, with SFT yielding superior multimodal action conditioning.

M $^2$ LongBench: A Rigorous Multimodal Benchmark

Deep-Reporter necessitates a testbed capable of holistic multimodal evaluation. M $^2$ LongBench was designed as a comprehensive, domain-diverse benchmark comprising 247 tasks with ground-truth blueprints, each annotated with exhaustive structural checklists and an average of 102 ground-truth images and 168 text chunks as evidence.

The core features are:

Massive Multimodal Sandbox (95K images, 108M text chunks) for scale, transparency, and extensibility.
Unified assessment protocols for text and visual retrieval, section-level integration, and outcome-level quality.
Expert-refined silver reports as references, supporting relative quality normalization and domain-specific breakdowns.
Open web transferability experiments, confirming high overlap (>83%) between M $^2$ LongBench and live web evidence, and demonstrating findings that generalize to unconstrained information environments.
Figure 3: Demonstration of the sandbox construction pipeline integrating structured, cross-modal evidence at scale.

Experimental Results

Agentic Multimodal Retrieval and Filtering

Extensive ablation studies revealed that raw retrieval alone does not distinguish between competitive agentic frameworks; performance gains are centralized in the filtering and selection stages. Notably:

Deep-Reporter’s searcher-filter design boosts precision at the evidence filtering stage, compressing large candidate sets (∼156 per task) to concentrated high-value evidence (∼81), raising selection precision (Qwen3-8B: Image selection, 8.3 → 45.0 after SFT).
Post-training (SFT) significantly amplifies image selection/placement capabilities, which are otherwise almost absent in base or DPO-only variants.

Multimodal Long-Form Generation Quality

Quantitative evaluation demonstrated that Deep-Reporter’s agentic orchestration yields order-of-magnitude improvements over strong baselines:

Output length and narrative structure metrics: Naïve RAG outputs are truncated (∼2k tokens) and shallow (overall 5.2), while Deep-Reporter produces >4k token, highly coherent multimodal reports (overall 27.2 base, up to 37.9 after SFT).
Sectional and citation-level multimodal integration: Section content quality rises from 6.4 (baseline) to 41.7 (Deep-Reporter + Qwen3-32B + SFT), with robust improvement in citation richness, coherence, placement, and clarity.
Ablations indicate that relevance-aware filtering and explicit recurrent context management are complementary—removing either disproportionately harms context-sensitive (full report) or evidence-grounding (section content) metrics.
Figure 4: Overall generation metrics vs. output length, highlighting Deep-Reporter’s capacity for high-quality, extended multimodal reports.

Figure 5: Model performance evaluation: holistic and stage-wise breakdowns, including subject-specific retrieval across modalities and pipeline stages.

Figure 6: Subject/domain-specific generation quality, showing significant cross-domain variability and metric-specific performance patterns.

Training Objective Analysis

SFT consistently outperforms DPO for this regime, demonstrating that sparse multimodal actions (visual citation/insertion) benefit from dense, token-level supervision over preference-based, trajectory-level optimization.

Efficiency and Practical Considerations

Filter-based relevance estimation incurs significant inference latency (∼10 minutes per report, dominant cost is filtering). However, SFT-improved models retain inference efficiency, and quality-latency trade-offs can be navigated by adjusting retrieval/filter budgets or deploying lightweight classifiers.

Implications and Future Directions

Deep-Reporter’s results establish the criticality of fine-grained agentic architectures for grounded multimodal research automation. Architectures lacking both explicit evidence selection (filtering) and contextual memory (recurrent management) suffer from severe context degradation and inability to perform reliable multimodal synthesis. The high reliance on structured action supervision (SFT > DPO) suggests RL-based approaches require dense reward signal engineering for compositional, cross-modal tasks. The M $^2$ LongBench design will likely serve as a template for future multimodal research and evaluation platforms, especially where unified access and transparency are required.

On the practical axis, Deep-Reporter highlights the importance of modular pipelines for deployable research assistants: model components (searcher, filter, reporter) and sandbox infrastructure can be incrementally scaled/updated for new domains, languages, or evidence modalities. Potential future developments include online RL optimization, multilingual capability extension, and tighter integration with knowledge graphs or domain-specific retrieval APIs.

Conclusion

Deep-Reporter advances the frontier of grounded, agentic multimodal report generation, establishing new best practices for both system design and evaluation. The framework delivers robust, highly factual long-form narratives tightly interleaved with visual evidence, with empirical superiority demonstrated on a new, rigorous multimodal benchmark. The modular and extensible design of both Deep-Reporter and M $^2$ LongBench provide a reproducible foundation for future research in agentic, evidence-driven content creation systems.