Papers
Topics
Authors
Recent
2000 character limit reached

CharacterBench: Model Evaluation Suite

Updated 8 December 2025
  • CharacterBench is a comprehensive evaluation suite for character-centric models, covering dialogue systems and 4D animation.
  • It utilizes large-scale, bilingual datasets and multidimensional probing queries to measure persona consistency and rendering quality.
  • The suite offers actionable insights with specialized metrics like Pearson correlation, PSNR, and TAM to drive model optimization.

CharacterBench is a suite of rigorous benchmarks designed to measure the fidelity, consistency, and controllability of character-centric generative models. It provides large-scale, multifaceted evaluation protocols for two distinct domains: character-based dialogue systems powered by LLMs and 4D character animation frameworks. CharacterBench enables robust model comparison, diagnostic analysis, and optimization by combining high-quality annotated datasets, precise probing queries, multidimensional scoring rubrics, and specialized automatic evaluation tools.

1. Scope and Motivation

The proliferation of character-driven AI—ranging from LLM-powered role-playing bots to dynamic 4D animated avatars—necessitates standardized benchmarks that reflect authentic usage scenarios and enable fine-grained diagnosis across diverse model families. Existing evaluation tools exhibit several methodological limitations: narrow coverage of character categories, reductive testing (single-choice questions, open-ended prompts), poor sensitivity to sparsely manifesting character features, and unstable or costly reliance on proprietary API judges. CharacterBench directly responds to these gaps by establishing: (a) exhaustive coverage of character types; (b) multidimensional evaluation aligned with interpersonal-interaction and creative rendering theory; (c) robust, cost-effective automatic scoring via bespoke models (Zhou et al., 16 Dec 2024, Gao et al., 10 Aug 2025).

2. Dataset Composition and Diversity

CharacterBench for Dialogue

CharacterBench constitutes the largest bilingual generative benchmark for LLM character customization, capturing 22,859 human-annotated samples from 13,162 multi-turn dialogues representing 3,956 distinct characters. Characters span 25 detailed categories grouped into four main domains—fictional heroes, historical figures, everyday professionals, and whimsical entities (e.g., sentient animals or trees). All data are provided both in Chinese and human-verified English translation. Character profiles are leveraged to surface sparse and dense evaluation targets; e.g., persona traits, boundary knowledge, and behavioral attributes (Zhou et al., 16 Dec 2024).

CharacterBench for 4D Animation

For controllable character animation, CharacterBench is built atop Character4D—a dataset containing 13,115 rigged, fully textured avatars sourced from VRoidHub, each paired with one of 40 Mixamo motion presets (dancing, singing, jumping, etc.). Characters are rendered at 768×768px from 21 semicircular camera views (radius 2.5 m, FoV 40°). The test split comprises held-out Character4D samples for both static and dynamic evaluation, plus challenging out-of-distribution (OOC) exemplars including anime, real humans, and arbitrary 3D models (Gao et al., 10 Aug 2025).

3. Task Definitions and Evaluation Procedures

Dialogue and Role-Playing

CharacterBench evaluates LLMs on their ability to maintain rich, customized personas across eleven granular dimensions grouped by six principal aspects: memory, knowledge, persona, emotion, morality, and believability. Sparse dimensions (e.g., memory consistency, fact accuracy, attribute consistency) are probed via target-oriented queries linked directly to character profiles; dense dimensions (e.g., morality stability, believability) are elicited via target-free prompts such as toxicity injection or user derailing strategies. Dual annotator protocols with tie-breaking oversight are employed for robust scoring. Annotation scales vary by aspect, with up to five points for subjective dimensions (Zhou et al., 16 Dec 2024).

4D Animation and Rendering

Three distinct tasks are defined:

  • Static Novel-View Synthesis: Generate held-out views of a character in canonical A-pose from a single reference view.
  • Multi-View Video Synthesis: Input a character image and a 2D pose-sequence; output temporally coherent videos from multiple novel cameras, evaluated over a 9×9 grid (frames × views).
  • 4D Reconstruction and Rendering: Using synthesized multi-view videos as pseudo-ground-truth, optimize a continuous 4D representation for rendering additional novel views across frames (Gao et al., 10 Aug 2025).

4. Scoring Metrics and Automatic Evaluation

Dialogue Benchmark Metrics

Key metrics include Pearson correlation with human ratings, Spearman’s rho, and Kendall’s tau for ranking consistency across model pools. For automatic judgment, CharacterJudge—a specialized 7B LLM fine-tuned on 19,609 training samples—serves as a stable, cost-efficient alternative to GPT-4, yielding average bilingual correlations of 68%, outperforming GPT-4-based judges by over 20 percentage points. CharacterJudge supports both individual response scoring and model-level rank generation (Zhou et al., 16 Dec 2024).

Animation Benchmark Metrics

CharacterBench combines pixel-wise, perceptual, and distributional fidelity metrics:

  • Peak Signal-to-Noise Ratio (PSNR)
  • Structural Similarity Index (SSIM)
  • LPIPS (Learned Perceptual Image Patch Similarity)
  • CLIP-Score (cosine between CLIP embeddings)
  • Fréchet Inception Distance (FID/FVD for static and dynamic sequences)
  • Chamfer Distance (3D-only) FVD is further decomposed into slices (FVD-F, FVD-V, FVD-D) for localized diagnosis; FV4D aggregates full sequence consistency (Gao et al., 10 Aug 2025).

Think–Act Matching for Multiversal Role-Play

In multiversal role-playing settings (“Beyond One World”), the Think–Act Matching (TAM) metric quantifies alignment between internal deliberation (“thinking”) and choices (“acting”) via maximum cosine similarity of embedded reason/action spans:

TAM=maxi,je(ti),e(aj)e(ti)e(aj)\text{TAM} = \max_{i,j} \frac{\langle \mathbf{e}(t_i), \mathbf{e}(a_j) \rangle}{\|\mathbf{e}(t_i)\| \|\mathbf{e}(a_j)\|}

Interpreted as a proxy for trustworthiness and calibration of internal/external fidelity (Ngokpol et al., 16 Oct 2025).

5. Experimental Findings and Model Analysis

Dialogue Systems

CharacterJudge demonstrates robust generalizability: in-domain and out-of-domain character scores are comparable, implying strong transfer. Model rankings derived from CharacterBench achieve Spearman’s ρ\rho of 73.1%, surpassing earlier benchmarks such as CharacterEval (21.4%) and SocialBench (38.1%). Fine-tuning and DPO optimization using CharacterBench data yield consistent improvements in dialogic win rates (up to 8%) (Zhou et al., 16 Dec 2024).

4D Animation

CharacterShot, evaluated on CharacterBench, substantially outperforms baseline methods on all metrics. Its dual-attention architecture and camera-prior tokens halve FVD compared to DiT-style baselines and nearly eliminate inter-view pose drift. Neighbor-constrained Gaussian splatting reduces FV4D by 15%, mitigating limb flicker and “popping” in high-motion settings. Most failure cases for prior methods are observed in extreme side/rear views and complex limb crossings; CharacterShot retains silhouette integrity and texture consistency. On OOC samples, CharacterShot achieves highest subjective robustness in a 30-participant paper (Gao et al., 10 Aug 2025).

Multiversal Consistency and Reasoning Alignment

Benchmarks probing multiverse and time-variant character fidelity (e.g., “Beyond One World”) reveal persistent cross-version generalization deficits even in SOTA LLMs (accuracy drops of 5–25 points), and a pronounced reasoning–acting gap—models often excel at internal deliberation or action consistency, but rarely both. Chain-of-Thought (CoT) prompting differentially benefits weaker models (coherence) and hinders stronger models (factual hallucination). TAM highlights this misalignment, guiding future model calibration and human-in-the-loop optimization (Ngokpol et al., 16 Oct 2025).

6. Applications and Prospective Extensions

CharacterBench serves dual purposes as a testbed for research and a resource for training/fine-tuning models toward character fidelity objectives. Applications include:

  • Automated evaluation and ranking of LLMs for character-based dialogue.
  • Diagnostic assessment of sequential and 4D character rendering pipelines.
  • Direct optimization for reward and preference modeling (e.g., DPO with auxiliary judges).

Potential future directions include: expansion to additional character types and languages; longitudinal assessment of persona retention over extended dialogues; integration of multimodal character cues; incorporation of nuanced moral and legal compliance criteria; and robustness testing against adversarial or derailing user inputs (Zhou et al., 16 Dec 2024, Gao et al., 10 Aug 2025).

7. Summary Table: Major CharacterBench Benchmarks

Domain Data Size / Classes Key Metrics
LLM Character Dialogue 22,859 samples, 3,956 chars Pearson rr, Spearman’s ρ\rho, CharacterJudge, fine-grained annotation
4D Animation 13,115 avatars, 40 motions PSNR, SSIM, LPIPS, CLIP-S, FID, FVD, FV4D
Multiversal Roleplay 30 heroes × 3 versions Canonical accuracy, Reasoning fidelity, TAM

CharacterBench enables reproducible, granular evaluation for both language and animation models, highlighting current limitations and guiding future advances in character control, expressiveness, and consistency.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to CharacterBench.