Papers
Topics
Authors
Recent
2000 character limit reached

AniEval: Automated Animation Evaluation

Updated 21 December 2025
  • AniEval is a multi-shot–aware evaluation framework that assesses the quality of generated animations in story-driven video synthesis.
  • It integrates domain-specific metrics from overall video quality, text-video alignment, video consistency, and motion quality to ensure both narrative coherence and visual integrity.
  • By coupling with a Monte Carlo Tree Search–driven generation kernel, AniEval prunes low-quality animation paths to maintain global storytelling consistency.

AniEval is a fully automated, multi-shot–aware evaluation framework purpose-built for assessing the quality of generated animation in the context of story-driven video synthesis. Originating as the core of the Reviewer Agent within the AniMaker framework, AniEval is designed to surpass conventional single-shot video metrics by providing fine-grained, context-sensitive scoring of animation clips both in isolation and within temporal context. Its integration with a Monte Carlo Tree Search–driven generation kernel enforces global narrative consistency and visual coherence across entire animated sequences (Shi et al., 12 Jun 2025).

1. Architecture and Role within AniMaker

AniEval is implemented as the core evaluation engine of the Reviewer Agent in AniMaker’s multi-agent pipeline. Its central task is to assign composite quality scores to candidate video clips generated by the Photography Agent using MCTS-Gen, a Monte Carlo Tree Search–inspired candidate exploration algorithm. The returned AniEval scores directly inform MCTS-Gen’s selection and back-propagation stages, pruning poor-quality branches early in the search and promoting globally consistent storytelling. This tight coupling between evaluation and generation sets AniEval apart from standard ex post metrics by enabling real-time, context-sensitive feedback during the animation synthesis loop.

2. Pipeline Structure and Quality Domain Metrics

AniEval’s evaluation pipeline ingests, for a target animation clip vkv_k, a set of frames comprising: the last few frames of the previous clip vk1v_{k-1} (when available), all frames of vkv_k, and the first few frames of the subsequent clip vk+1v_{k+1}. Evaluation proceeds through four parallel domain-specific metric groups, followed by aggregation. The modules are:

  • Preprocessing: Key frame extraction (first, last, and middle frames), optical flow calculation, face and object detection, and action recognition.
  • Metric Domains:
    • Overall Video Quality (OVQ): Measures aesthetic (VQA_A), technical (VQA_T), and per-frame quality (MusIQ).
    • Text-Video Alignment (TVA): Assesses semantic congruence using CLIP-based text-video consistency, BLIP-BLEU text-story consistency, object detection score, and object count score.
    • Video Consistency (VC): Enforces perceptual and identity continuity using DreamSim (LPIPS-style frame similarity), face consistency (via an anime-face–trained InceptionNext), warping error (framewise optical-flow 1\ell_1 penalty), and semantic consistency (CLIP embedding similarity).
    • Motion Quality (MQ): Quantifies action realism and completeness using automated action recognition (classifier agreement with prompts), action strength (mean optical flow), and Motion AC-Score (correlation between prompted motion amplitude and measured flow).

Each module output within a domain is min–max normalized and averaged to yield the domain score. The total AniEval score for a clip is given by an unweighted average of the four domain scores:

TotalScore(vk)=14OVQ+14TVA+14VC+14MQ\text{TotalScore}(v_k) = \tfrac{1}{4} \cdot \text{OVQ} + \tfrac{1}{4} \cdot \text{TVA} + \tfrac{1}{4} \cdot \text{VC} + \tfrac{1}{4} \cdot \text{MQ}

AniEval Quality Domains and Submetrics

Domain Sub-Metrics Evaluated Aspect
Overall Video Quality VQA_A, VQA_T, MusIQ Aesthetic & technical frame quality
Text-Video Alignment CLIP Consistency, BLIP-BLEU, Detection-Score, Count-Score Semantic and object consistency
Video Consistency DreamSim, Face Consistency, Warping Error, Semantic Consistency Temporal and perceptual coherence
Motion Quality Action Recognition, Action Strength, Motion AC-Score Fidelity and completeness of actions

3. Formal Criteria and Metric Formulations

AniEval employs explicit, referenceable criteria across modules. Select formalizations include:

  • DreamSim: Perceptual frame distance via a learned LPIPS-style network ϕ\phi:

DreamSim(fi,fj)=ϕ(fi)ϕ(fj)2\operatorname{DreamSim}(f_i, f_j) = \|\phi(f_i) - \phi(f_j)\|_2

  • Warping Error: Temporal pixel stability using optical flow WW: For warping fif_i to fi+1f_{i+1}, compute mean 1\ell_1 pixel error.
  • CLIP Consistency: Text-video semantic alignment:

CLIP(framestack,prompt)=cos(embvideo,embtext)\operatorname{CLIP}(\text{framestack}, \text{prompt}) = \cos(\text{emb}_{\text{video}}, \text{emb}_{\text{text}})

  • Face Consistency: Temporal identity preservation:

1mean pairwise cosine distance across face embeddings {ei}1 - \text{mean pairwise cosine distance across face embeddings}~\{e_i\}

No additional optimization or learning is performed at evaluation; all modules function zero-shot given pretrained/fine-tuned sub-models.

4. Contextual and Temporal Modeling

AniEval is distinguished by its multi-shot temporal design. Unlike single-shot video metrics (e.g., VBench), AniEval explicitly incorporates preceding and succeeding clips into each evaluation:

  • Boundary Consistency: For VC domain metrics (DreamSim, Semantic Consistency), additional comparison pairs are formed between the final frame of vk1v_{k-1} and the first frame of vkv_k, and between vkv_k and vk+1v_{k+1}.
  • Penalizing Discontinuity: Abrupt cuts, identity jitter, character teleportation, or object inconsistencies across scene boundaries are reflected in per-domain and aggregate scores.
  • Uniform Treatment: Internally, both within-clip and cross-clip frame pairs are processed identically and then pooled into the VC domain.

This contextual awareness enables AniEval to measure properties critical to coherent animated storytelling, such as narrative flow and global character integrity.

5. Novel Sub-Models and Domain-Specific Adaptations

AniEval integrates several models and components designed or adapted specifically for animated, stylized content:

  • Face Consistency (Anime Face Embeddings): An InceptionNext model fine-tuned on an anime-face dataset robustly detects and embeds stylized faces, overcoming failures of generic detectors such as MTCNN.
  • Action Recognition: Automated classifiers, fine-tuned on synthetic animated clips, reliably recognize actions specified by generation prompts (e.g., “pick up,” “hop,” “run”), enabling objective motion quality assessment.
  • Object Count-Score: Quantifies the discrepancy detected_objectsireference_counti/reference_counti|{\text{detected\_objects}}_i – {\text{reference\_count}}_i| / {\text{reference\_count}}_i to penalize spurious or missing objects, especially across scene transitions.

These adaptations address failure modes unique to animated content, including stylization-induced detector confusion and action ambiguity.

6. Empirical Results and Impact on Generation

Quantitative validation on the AniMaker corpus demonstrates AniEval’s efficacy:

  • Score Gains Over Baselines: AniMaker achieves an AniEval TotalScore of 76.72, outperforming VideoGen-of-Thought by 14.6 percentage points and surpassing the closest competitor’s Video Consistency by 15.5%.
  • Ablation Results: Substituting VBench for AniEval in MCTS-Gen selection reduces the TotalScore by 4.6%; disabling Best-of-N selection within MCTS-Gen reduces it by 7.1%. This experimentally establishes AniEval’s importance for efficient pruning and optimal path selection in the animation generation process.
  • Human Judgement Alignment: AniEval scores align closely with human assessments of narrative consistency, fluidity, and scene transitions.

These results underscore AniEval’s central role in producing technically robust and narratively coherent AI-generated animation.

AniEval is contrasted with general-purpose video and multimedia evaluation metrics by its multi-domain, temporally contextual, and animation-focused construction. Unlike VBench, which scores video quality per clip without contextual integration, AniEval models narrative and temporal dependencies essential to long-form animated storytelling. All scoring occurs zero-shot given fixed sub-models, ensuring full automation and scalability for large-scale, multi-candidate generations (Shi et al., 12 Jun 2025).

A plausible implication is that approaches such as Tau-Eval for text anonymization and AlphaEval for financial model evaluation illuminate the utility of modular, multi-domain scoring systems, but lack the temporal and semantic fine-tuning specific to animated content. AniEval thus represents a specialized adaptation of multi-criteria evaluation, tightly integrated with animation synthesis pipelines.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to AniEval Framework.