Papers
Topics
Authors
Recent
Search
2000 character limit reached

BOOKAGENT: Orchestrating Safety-Aware Visual Narratives via Multi-Agent Cognitive Calibration

Published 17 Apr 2026 in cs.CV | (2604.16541v1)

Abstract: Recent advancements in Large Generative Models (LGMs) have revolutionized multi-modal generation. However, generating illustrated storybooks remains an open challenge, where prior works mainly decompose this task into separate stages, and thus, holistic multi-modal grounding remains limited. Besides, while safety alignment is studied for text- or image-only generation, existing works rarely integrate child-specific safety constraints into narrative planning and sequence-level multi-modal verification. To address these limitations, we propose BookAgent, a safety-aware multi-agent collaboration framework designed for high-quality, safety-aware visual narratives. Different from prior story visualization models that assume a fixed storyline sequence, BookAgent targets end-to-end storybook synthesis from a user draft by jointly planning, scripting, illustrating, and globally repairing inconsistencies. To ensure precise multi-modal grounding, BookAgent dynamically calibrates page-level alignment between textual scripts and visual layouts. Furthermore, BookAgent calibrates holistic consistency from the temporal dimension, by verifying-then-rectifying global inconsistencies in character identity and storytelling logic. Extensive experiments demonstrate that BookAgent significantly outperforms current methods in narrative coherence, visual consistency, and safety compliance, offering a robust paradigm for reliable agents in complex multi-modal creation. The implementation will be publicly released at https://github.com/bogao-code/BookAgent/tree/main.

Summary

  • The paper introduces BookAgent, an end-to-end multi-agent framework that ensures safety-aware, consistent visual narrative generation.
  • It employs specialized agents for value-aligned storyboarding, iterative cross-modal refinement, and temporal cognitive calibration to maintain narrative and visual consistency.
  • Empirical results demonstrate superior image-text and character consistency scores compared to baselines, along with robust safety adherence in multi-page story synthesis.

BookAgent: Multi-Agent Cognitive Calibration for Safety-Aware Visual Story Generation

Motivation and Problem Statement

Illustrated storybook synthesis is a multifaceted generative task, requiring tight multimodal coupling between textual narrative and visual rendering while maintaining strong long-horizon consistency and explicit child-safety. Existing approaches either (1) decouple textual and visual streams with stage-wise or autoregressive pipelines, resulting in brittle multimodal alignment and semantic drift, or (2) treat safety as an afterthought, insufficiently integrating child-centric guardrails into the generative process. The "BOOKAGENT" framework (2604.16541) directly addresses these limitations by formulating storybook synthesis as an end-to-end, closed-loop collaboration between specialized agents, each targeting a critical aspect of consistency, verification, and safety. Figure 1

Figure 1: Comparison between single-pass baseline (top) and BookAgent (bottom) on multi-page story synthesis, highlighting BookAgent's maintenance of character identity, temporal order, and visual consistency.

Architecture: Orchestrated Multi-Agent Workflow

BookAgent adopts a hierarchical, role-specialized architecture built on large multi-modal LLM backbones (Gemini 3.0 and Nano-Banana) to implement collaborative reasoning and generation cycles. The pipeline is decomposed into three core mechanisms:

  1. Value-Aligned Storyboarding (VAS): Intake and auditing of user draft for explicit safety compliance, script refinement to align with page/character constraints, and extraction of canonical character sheets to serve as identity anchors.
  2. Iterative Cross-modal Refinement (ICR): Per-page multimodal generation with a generate–verify–revise loop. Image/text consistency (frame-level), identity anchoring, and multimodal safety are orchestrated by specialized agent roles (Frame Director, Identity Director, etc.). Local critiques are aggregated and rolled into prompt updates to ensure progressive error correction at generation time.
  3. Temporal Cognitive Calibration (TCC): Post-sequence global verification for cross-page consistency in attributes, narrative, and object logic. Selective re-synthesis and repair are triggered for pages violating long-horizon constraints, informed by structured feedback on detected inconsistencies. Figure 2

    Figure 2: Schematic overview of BookAgent, presenting the closed-loop, multi-agent architecture with VAS, ICR, and TCC mechanisms.

The framework enforces hard safety constraints via dual text and image auditors at each stage and is instantiated to maximize multimodal attribute faithfulness, identity consistency, and sequence coherence under explicit safety requirements.

Empirical Evaluation: Robustness, Consistency, and Safety

Qualitative and Quantitative Benchmarking

BookAgent was benchmarked on a suite of 16 multi-page stories (5–20 pages each) designed with compositional constraints involving spatial, symbolic, and procedural invariants across pages. Comparative baselines include StoryGPT-V, MovieAgent, and StoryGen. The evaluation protocol spans semantic image-text alignment, cross-frame character/object continuity, and strict safety adherence.

Notably, BookAgent achieves significantly higher scores in cross-frame consistency and semantic alignment:

  • Image-Text Consistency: 4.6 (BookAgent) vs. 3.1 (StoryGPT-V)
  • Character Consistency: 4.7 (BookAgent) vs. 2.4 (StoryGPT-V)
  • Safety: 4.8 (BookAgent) vs. 4.5 (StoryGPT-V), outperforming all baselines.

Qualitative analyses (see below) reveal that baselines are susceptible to attribute drift, inconsistent narration, and safety violations, while BookAgent robustly preserves logical bindings, object trajectories, and visual semantics over long horizons. Figure 3

Figure 3: Visual comparison on character/object consistency in the Milo story, showing appearance drift in baselines and stable identity maintenance by BookAgent.

Figure 4

Figure 4: Handling hard attribute constraints (Rowan); only BookAgent reliably enforces symbolic logic and object counts across all frames.

User Study

A structured user study targeting parents of children aged 4–10 confirms BookAgent's superiority, with preference ratings substantially above competing methods, attributed to improved story coherence and visual engagement. Figure 5

Figure 5: User study results reporting mean preference scores, with BookAgent outperforming alternatives across the board.

Ablation Analysis

Ablation experiments quantify the contributions of VAS, ICR, and TCC:

  • Removing ICR: Degrades image-text consistency, confirming that static feed-forward generation is insufficient for robust grounding.
  • Removing TCC: Severely impacts cross-frame character consistency, underlining the need for explicit temporal verification and repair.
  • Excluding VAS: Results in reduced safety compliance and increased presence of inappropriate content, demonstrating the impact of pre-generation auditing. Figure 6

    Figure 6: Visualizations from ablation showing error correction enabled by ICR and TCC; red denotes inconsistency, green denotes repaired consistency.

Long-Horizon Stress Testing

In expert-level tests (see below), BookAgent maintains invariance on complex, interdependent constraints (e.g., attribute bindings, object locations, ticket counts), successfully detecting and correcting intentional rule violations throughout extended illustrated narratives. Figure 7

Figure 7: Consecutive panels from the long narrative stress test, illustrating BookAgent's strict maintenance and repair of rule-based invariants across dense narrative progression.

Structured Feedback and Interactive Deployment

The core generate-verify-correct loop is operationalized via real-time structured feedback. The system iteratively diagnoses and annotates fine-grained errors, including attribute drift, semantic mismatch, and prop continuity violations, activating targeted repair routines where necessary. Figure 8

Figure 8: Sampling of structured feedback highlighting cross-page inconsistencies not locally detectable at generation time.

A production-quality web interface exposes interactive story definition, style controls, and fine-grained generation configuration, supporting variable-length stories and advanced multimodal parameterization. Figure 9

Figure 9

Figure 9: Left: Iterative global repair during generation, with reference-guided consistency enforcement. Right: User interface for parameterized control over the synthesis process.

Limitations and Implications

BookAgent marks a significant advancement in pushing agent-based multimodal generation toward robust, safety-aware, long-horizon storytelling. However, several challenges persist:

  • Model Dependency: Overall performance is fundamentally bounded by the underlying multimodal foundation models. Low-level failures and rare misunderstandings still arise.
  • Scalability: Maintaining consistent identity for a large and dynamic cast of entities presents memory and reasoning bottlenecks, especially as the number of tracked objects increases with story length.
  • Efficiency: The iterative validation and repair paradigm increases inference overhead compared to feed-forward pipelines; thus, online or interactive deployments for very long stories are computationally expensive.

Practically, BookAgent provides a strong reference paradigm for safe, reliable storybook generation suitable for sensitive domains such as child education and entertainment. Theoretically, its approach to co-optimization and closed-loop cognitive calibration can lay the foundation for robust agentic control in broader multimodal AI applications—extensible to procedural animation, interactive fiction, or educational content generation.

Conclusion

BookAgent demonstrates that orchestrated multi-agent collaboration, with explicit cognitive calibration for safety and consistency, is effective for synthesizing coherent, safe visual narratives under complex constraints. The integration of planning, iterative multimodal refinement, and temporal global auditing critically improves long-horizon compositionality and safety compared to prior methods. Future directions include scalable multi-entity consistency, adaptive verification under computational constraints, and the extension of cognitive calibration paradigms to other domains of reliable, explainable multi-agent AI.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.