Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 83 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 21 tok/s Pro

GPT-5 High 20 tok/s Pro

GPT-4o 103 tok/s Pro

Kimi K2 205 tok/s Pro

GPT OSS 120B 456 tok/s Pro

Claude Sonnet 4 35 tok/s Pro

2000 character limit reached

AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation (2506.10540v1)

Published 12 Jun 2025 in cs.MA and cs.CV

Abstract: Despite rapid advancements in video generation models, generating coherent storytelling videos that span multiple scenes and characters remains challenging. Current methods often rigidly convert pre-generated keyframes into fixed-length clips, resulting in disjointed narratives and pacing issues. Furthermore, the inherent instability of video generation models means that even a single low-quality clip can significantly degrade the entire output animation's logical coherence and visual continuity. To overcome these obstacles, we introduce AniMaker, a multi-agent framework enabling efficient multi-candidate clip generation and storytelling-aware clip selection, thus creating globally consistent and story-coherent animation solely from text input. The framework is structured around specialized agents, including the Director Agent for storyboard generation, the Photography Agent for video clip generation, the Reviewer Agent for evaluation, and the Post-Production Agent for editing and voiceover. Central to AniMaker's approach are two key technical components: MCTS-Gen in Photography Agent, an efficient Monte Carlo Tree Search (MCTS)-inspired strategy that intelligently navigates the candidate space to generate high-potential clips while optimizing resource usage; and AniEval in Reviewer Agent, the first framework specifically designed for multi-shot animation evaluation, which assesses critical aspects such as story-level consistency, action completion, and animation-specific features by considering each clip in the context of its preceding and succeeding clips. Experiments demonstrate that AniMaker achieves superior quality as measured by popular metrics including VBench and our proposed AniEval framework, while significantly improving the efficiency of multi-candidate generation, pushing AI-generated storytelling animation closer to production standards.

Summary

The paper introduces AniMaker, a framework that uses multiple agents and MCTS-driven clip generation to transform text into coherent animated narratives.
The paper details a novel MCTS-Gen strategy that efficiently balances exploration and exploitation to generate high-quality video clips.
The paper demonstrates improved narrative and visual coherence through the AniEval evaluation framework and superior performance on standard metrics.

AniMaker: Automated Multi-Agent Animated Storytelling with MCTS-Driven Clip Generation

Introduction

The "AniMaker" framework presents a sophisticated approach to generating storytelling animations directly from textual inputs. This framework is designed to address the challenges inherent in creating coherent, long-form animated videos that involve multiple scenes and characters. Traditional methods have struggled with maintaining narrative coherence and visual continuity because they often rely on rigid keyframe conversion techniques, leading to disjointed narratives. AniMaker proposes a novel solution by employing a multi-agent system to achieve automation in high-quality animated storytelling.

Architecture Overview

AniMaker’s architecture is composed of several specialized agents: the Director Agent, Photography Agent, Reviewer Agent, and Post-Production Agent. These agents collaborate to transform text into fully realized animations, each fulfilling specific roles in the animation production pipeline.

Director Agent: Responsible for converting text prompts into detailed scripts and storyboards, incorporating character and background management.
Photography Agent: Utilizes the MCTS-Gen strategy to efficiently explore and generate video clip candidates, balancing exploration and resource usage.
Reviewer Agent: Evaluates candidate clips using the AniEval system to ensure story and visual consistency.
Post-Production Agent: Compiles selected clips into the final animation, adding voiceovers and synchronizing audio.
Figure 1: The overall architecture of our AniMaker framework. Given a story input, Director Agent creates detailed scripts and storyboards with reference images. Photography Agent generates candidate video clips using MCTS-Gen, which optimizes exploration-exploitation balance. Reviewer Agent evaluates clips with our AniEval assessment system. Post-production Agent assembles selected clips, adds voiceovers, and synchronizes audio with subtitles. This multi-agent system enables fully automated, high-quality animated storytelling.

MCTS-Driven Clip Generation

The core of the AniMaker framework is the MCTS-Gen strategy employed by the Photography Agent. This Monte Carlo Tree Search-inspired method effectively navigates the vast candidate space of video generation by focusing on promising paths while significantly reducing computational overhead. MCTS-Gen optimizes both the exploration of diverse clips and the evaluation of high-potential sequences, ensuring coherent video outputs.

Figure 2: Illustration of our MCTS-Gen strategy for efficient Best-of-N Sampling.

AniEval Framework

AniEval advances beyond existing evaluation metrics by specifically targeting multi-shot storytelling animation. It provides a context-aware assessment that includes various dimensions such as story consistency, action completion, and animation-specific features. AniEval evaluates each clip in the context of its neighbors, optimizing the selection process for creating a coherent narrative flow.

Experimental Results

AniMaker demonstrates superior performance across several standard and novel evaluation metrics, including VBench and the newly proposed AniEval framework. It achieves consistent top-tier results in visual and narrative coherence compared to existing models specialized in visual narration and video generation.

Scene Image and Video Generation

In scene image generation, AniMaker outperforms other models with a notable improvement in text-to-image similarity metrics. For video generation, AniMaker achieves superior scores in metrics of visual appeal and character consistency, indicating its effectiveness in maintaining narrative fidelity.

Conclusion

AniMaker provides a robust framework for fully automated storytelling animation, from text input to final video production. By integrating advanced MCTS-based generation and a comprehensive evaluation system, it significantly enhances narrative and visual coherence in animated storytelling. The framework's modular design allows for future advancements as new models become available, thus steadily narrowing the gap between AI-generated content and production-quality animation.