StoryBox: Collaborative Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation Using Large Language Models (2510.11618v1)

Published 13 Oct 2025 in cs.CL and cs.MA

Abstract: Human writers often begin their stories with an overarching mental scene, where they envision the interactions between characters and their environment. Inspired by this creative process, we propose a novel approach to long-form story generation, termed hybrid bottom-up long-form story generation, using multi-agent simulations. In our method, agents interact within a dynamic sandbox environment, where their behaviors and interactions with one another and the environment generate emergent events. These events form the foundation for the story, enabling organic character development and plot progression. Unlike traditional top-down approaches that impose rigid structures, our hybrid bottom-up approach allows for the natural unfolding of events, fostering more spontaneous and engaging storytelling. The system is capable of generating stories exceeding 10,000 words while maintaining coherence and consistency, addressing some of the key challenges faced by current story generation models. We achieve state-of-the-art performance across several metrics. This approach offers a scalable and innovative solution for creating dynamic, immersive long-form stories that evolve organically from agent-driven interactions.

Summary

The paper introduces a hybrid bottom-up system using multi-agent simulation to generate long-form narratives.
It employs a dynamic event summarization mechanism that transforms emergent agent actions into coherent story arcs.
Empirical evaluation demonstrates superior performance in plot, language, and character consistency over baseline models.

StoryBox: Multi-Agent Simulation for Hybrid Bottom-Up Long-Form Story Generation

Introduction and Motivation

StoryBox introduces a hybrid bottom-up approach to long-form story generation, leveraging collaborative multi-agent simulation within a dynamic sandbox environment. The system is motivated by the observation that human writers often conceptualize stories as evolving scenes with interacting characters, rather than as rigidly pre-planned outlines. StoryBox operationalizes this intuition by simulating agent interactions that generate emergent events, which are then synthesized into coherent, extended narratives by a dedicated Storyteller Agent. This paradigm contrasts with traditional top-down methods, which often struggle to maintain narrative coherence and character consistency over long-form outputs.

Figure 1: The timeline of the multi-agent sandbox simulation, where agent interactions with each other and their environment trigger emergent events that drive dynamic, hybrid bottom-up story generation.

System Architecture

The StoryBox framework consists of two principal components: a multi-agent sandbox simulation and a Storyteller Agent. The simulation models agents (characters) with rich persona attributes, daily plans, and probabilistic abnormal behaviors, situated within a hierarchically structured environment. Agent actions—move, chat, or none—are recorded as events with detailed contextual metadata. The environment is modeled as a tree structure, supporting scalable and flexible world-building beyond the constraints of tile-based systems.

Figure 2: Overview of the system framework for long-form story generation, including the Persona Scratch Information for defining character settings, the sandbox where agent interactions generate events, and the Storyteller Agent that uses these events to craft a complete story.

Agent and Environment Modeling

Each agent is initialized with a "Persona Scratch Information" profile, encompassing static (e.g., name, age, innate traits) and dynamic (e.g., current state, daily plan requirements) attributes. The abnormal behavior attribute, governed by a tunable "Abnormal Factor," injects stochasticity and narrative tension by allowing agents to deviate from routine. The environment is hierarchically organized (World → Region → Zone → Area → Object), with each node optionally described, enabling agents to perceive and interact with their surroundings at varying granularities.

Figure 3: Overview of environment modeling using a tree-like structure with five hierarchical levels, enabling a flexible and expansive environment.

Event Recording and Summarization

All agent actions are logged as events, each with a unique ID, temporal bounds, participants, location, and both a brief description and a detailed contextualization. This event log forms the substrate for subsequent story synthesis. To address LLM context window limitations, events are summarized hierarchically: first by character and day, then via a dynamic windowing mechanism that adaptively chunks events for further abstraction.

Hybrid Bottom-Up Story Generation

The Storyteller Agent orchestrates the transformation of sandbox events into a long-form narrative. The process is iterative and information-retrieval-driven, combining bottom-up event selection with top-down narrative structuring.

Figure 4: Overview of the Storyteller Agent workflow for generating long-form story using a hybrid bottom-up approach, from sandbox events to iterative story generation.

Workflow

Event Summarization: Chronologically ordered events are summarized by character and day, then further abstracted using dynamic windowing to fit within LLM context constraints.
Story Information Generation: The system first determines the story type (genre), then iteratively generates and refines the title, background, themes, chapter titles, conflicts, and major plot points, guided by both event summaries and user-specified hyperparameters.
Iterative Story Generation: For each chapter, relevant story and sandbox information is retrieved using both keyword and embedding-based search. Chapters are generated in multiple passes, with summaries of previous chapters incorporated to maintain coherence. The process continues until the full narrative is synthesized.

This hybrid approach ensures that the emergent, agent-driven events ground the narrative, while the Storyteller Agent imposes global structure and thematic consistency.

Experimental Evaluation

Datasets and Baselines

StoryBox is evaluated on a custom dataset of 20 diverse story settings, each with detailed premises, settings, and character profiles. Baselines include:

Vanilla LLMs: Direct generation with GPT-4o and DeepSeek-V3.
Structured Frameworks: Re $^3$ and DOC-V2, which use hierarchical planning and control.
Multi-Agent Simulations: IBSEN, which employs director-actor agent collaboration.

Metrics

Evaluation combines human and LLM-based pairwise comparisons across six dimensions: Plot, Creativity, Character Development, Language Use, Conflict Quality, and Overall. Additional metrics include Character Behavior Consistency (sandbox-specific) and Average Word Count.

Results

Figure 5: Comparative performance of different methods across multi evaluation dimensions: Plot, Creativity, Character Development, Language Use, Conflict Quality, and Overall. Subfigure (a) presents rankings based on LLM-based evaluation, while subfigure (b) shows rankings from human evaluation. We also include the sandbox-specific metric Character Behavior Consistency, along with the Average Word Count for each method.

StoryBox achieves the highest scores across all major metrics in both automatic and human evaluations. Notably:

Plot and Character Development: StoryBox outperforms all baselines, with IBSEN as the closest competitor, indicating the efficacy of simulation-based approaches for modeling character dynamics.
Language Use and Conflict Quality: The integration of environmental and contextual cues enables richer, more vivid narratives and structured tension.
Character Behavior Consistency: StoryBox slightly surpasses IBSEN, demonstrating superior alignment between agent actions and persona definitions.
Average Word Count: StoryBox consistently produces stories averaging ~12,000 words, exceeding the output length of all baselines.

Simulation Duration and Ablation Studies

Figure 6: Effect of simulation duration on story generation performance.

Increasing simulation duration (number of in-game days) improves Character Development and Conflict Quality up to a point (7 days), after which returns diminish and computational costs escalate. Plot, Creativity, and Language Use are less sensitive to simulation length.

Figure 7: Performance comparison of StoryBox without different components.

Ablation studies reveal that:

Removing object descriptions degrades Language Use.
Disabling abnormal behaviors sharply reduces Creativity, Character Development, and Conflict Quality.
Omitting the dynamic context window impairs Plot coherence.

Each component is thus essential for optimal system performance.

Implementation Considerations

Simulation Efficiency: The current sequential agent update scheme is a bottleneck; parallelization is non-trivial due to inter-agent dependencies.
Resource Requirements: A 6-agent, 7-day simulation with GPT-4o mini and local embedding models (on a single NVIDIA GTX 3090) requires ~4 hours end-to-end.
Scalability: The tree-based environment model supports virtually unbounded world expansion, but event summarization and retrieval must be carefully managed to avoid context overflow.
Evaluation: Human evaluation remains costly and subjective; improved automatic metrics for narrative quality are needed.

Implications and Future Directions

StoryBox demonstrates that hybrid bottom-up story generation, grounded in multi-agent simulation, can produce long-form narratives with superior coherence, character consistency, and narrative depth compared to both vanilla LLMs and structured planning frameworks. The approach is extensible to interactive storytelling, game narrative generation, and simulation-based social science research.

Theoretically, the work suggests that emergent event-driven modeling, when combined with LLM-based narrative synthesis, can overcome the context and planning limitations of current LLMs for extended text generation. Practically, the modularity of the system allows for integration with more advanced agent architectures, richer world models, and user-in-the-loop customization.

Future research should address simulation parallelization, more robust event abstraction, and the development of scalable, reference-free evaluation metrics that better align with human narrative preferences.

Conclusion

StoryBox establishes a new paradigm for long-form story generation by integrating collaborative multi-agent simulation with hybrid bottom-up narrative synthesis. Empirical results substantiate its superiority over existing methods in both quantitative and qualitative dimensions. The framework's extensibility and demonstrated performance position it as a strong foundation for future advances in AI-driven narrative generation and simulation-based creative systems.