Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

Gemini 2.5 Flash 103 tok/s

Gemini 2.5 Pro 54 tok/s Pro

GPT-5 Medium 27 tok/s

GPT-5 High 37 tok/s Pro

GPT-4o 92 tok/s

GPT OSS 120B 467 tok/s Pro

Kimi K2 241 tok/s Pro

2000 character limit reached

OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation (2508.19209v1)

Published 26 Aug 2025 in cs.CV

Abstract: Existing video avatar models can produce fluid human animations, yet they struggle to move beyond mere physical likeness to capture a character's authentic essence. Their motions typically synchronize with low-level cues like audio rhythm, lacking a deeper semantic understanding of emotion, intent, or context. To bridge this gap, \textbf{we propose a framework designed to generate character animations that are not only physically plausible but also semantically coherent and expressive.} Our model, \textbf{OmniHuman-1.5}, is built upon two key technical contributions. First, we leverage Multimodal LLMs to synthesize a structured textual representation of conditions that provides high-level semantic guidance. This guidance steers our motion generator beyond simplistic rhythmic synchronization, enabling the production of actions that are contextually and emotionally resonant. Second, to ensure the effective fusion of these multimodal inputs and mitigate inter-modality conflicts, we introduce a specialized Multimodal DiT architecture with a novel Pseudo Last Frame design. The synergy of these components allows our model to accurately interpret the joint semantics of audio, images, and text, thereby generating motions that are deeply coherent with the character, scene, and linguistic content. Extensive experiments demonstrate that our model achieves leading performance across a comprehensive set of metrics, including lip-sync accuracy, video quality, motion naturalness and semantic consistency with textual prompts. Furthermore, our approach shows remarkable extensibility to complex scenarios, such as those involving multi-person and non-human subjects. Homepage: \href{https://omnihuman-lab.github.io/v1_5/}

Collections

Summary

The paper introduces a dual-system cognitive framework that models both reactive and deliberative behaviors for lifelike avatar rendering.
It leverages Multimodal Large Language Models for high-level planning and a diffusion transformer for low-level synthesis, ensuring semantic consistency.
Empirical results demonstrate improved motion naturalness and robust generalization across diverse avatars and multi-agent scenarios.

Cognitive Simulation for Lifelike Avatars: The OmniHuman-1.5 Framework

Introduction and Motivation

OmniHuman-1.5 addresses a fundamental limitation in current video avatar models: the inability to generate semantically coherent, contextually appropriate, and expressive human-like behaviors. Existing approaches predominantly synchronize avatar motion with low-level cues such as audio rhythm, resulting in outputs that lack deeper semantic understanding of intent, emotion, or context. The paper frames this gap through the lens of dual-system cognitive theory, distinguishing between fast, reactive (System 1) and slow, deliberative (System 2) processes. The proposed solution is a dual-system simulation framework that explicitly models both reactive and deliberative behaviors, leveraging Multimodal LLMs (MLLMs) for high-level reasoning and a specialized Multimodal Diffusion Transformer (MMDiT) for low-level rendering.

Figure 1: The dual-system theory motivates the integration of reactive (System 1) and deliberative (System 2) processes for avatar behavior, enabling both context-aware gestures and low-level synchronization.

Dual-System Simulation Architecture

The core architectural innovation is the explicit separation and integration of System 1 and System 2 processes. System 2 is instantiated as an agentic reasoning module powered by MLLMs, which ingests multimodal inputs (audio, image, text) and produces a structured, high-level "schedule" of actions. This schedule is then used to guide System 1, implemented as an MMDiT network, which synthesizes the final video by fusing information from dedicated text, audio, and video branches.

Figure 2: The dual-system simulation framework: System 2 (MLLM agent) generates a high-level action schedule, which guides System 1 (MMDiT) in multimodal video synthesis. Key training strategies mitigate modality conflicts.

Agentic Reasoning and Planning

The agentic reasoning module operates in two stages:

Analyzer MLLM: Receives the reference image, audio, and optional text prompt, and, via chain-of-thought prompting, infers persona, emotion, intent, and context, outputting a structured semantic representation.
Planner MLLM: Consumes the Analyzer's output and the reference image, generating a temporally segmented action plan (schedule) that specifies expressions and actions for each video segment.

A reflection mechanism enables the Planner to revise its plan based on generated frames, correcting for semantic drift and ensuring logical consistency in long-form video synthesis.

Multimodal Diffusion Transformer and Conditioning

The MMDiT backbone is enhanced with several critical design choices:

Dedicated Audio Branch: Instead of cross-attention injection, audio features are processed in a branch architecturally symmetric to the video and text branches, enabling deep, iterative fusion via shared multi-head self-attention.
Pseudo Last Frame Strategy: To avoid spurious correlations and motion artifacts from reference image conditioning, the model is trained with ground-truth first/last frames and, at inference, uses the user-provided reference image as a pseudo last frame with shifted positional encoding. This guides identity preservation without constraining motion diversity.
Two-Stage Warm-Up: To prevent modality dominance and overfitting, the model is first trained jointly, then fine-tuned with pre-trained weights for text/video branches and warmed-up weights for the audio branch.

Empirical Evaluation and Ablation

The framework is evaluated on challenging benchmarks, including custom single- and multi-subject test sets with diverse human, AIGC, anime, and animal avatars, as well as standard datasets (CelebV-HQ, CyberHost). Metrics include FID, FVD, IQA, ASE, Sync-C, HKC, and HKV, complemented by comprehensive user studies.

Key findings:

Agentic Reasoning: Ablation of the reasoning module leads to a marked increase in motion unnaturalness and a significant drop in user preference, despite minor changes in low-level metrics. The full model achieves a 20% reduction in perceived motion unnaturalness and a +0.29 GSB user preference score.
Conditioning Architecture: The proposed symmetric fusion and pseudo last frame strategy yield superior motion dynamics (HKV), lip-sync, and visual quality, outperforming state-of-the-art baselines in both objective and subjective evaluations.
Generalization: The model demonstrates robust performance on non-human avatars and multi-person scenes, with coordinated, context-aware behaviors and accurate conversational turn-taking.
Figure 3: The model generalizes to non-human avatars and multi-person scenes, maintaining contextually appropriate and coordinated behaviors.

Figure 4: User studies show strong preference for the proposed method over academic and proprietary baselines in both best-choice and pairwise GSB evaluations.

Qualitative Analysis and Reflection

The reflection process is shown to correct logical inconsistencies in action planning, such as object disappearance or illogical transitions, by revising the schedule based on generated frames.

Figure 5: The reflection process enables the model to revise illogical plans, ensuring object and action consistency across video segments.

Qualitative comparisons with strong baselines (e.g., OmniHuman-1) reveal that the proposed model produces actions with higher semantic alignment to speech and context, such as correctly depicting described activities and object interactions.

Figure 6: The model generates semantically consistent actions (e.g., makeup application, glowing crystal ball) in alignment with speech content, outperforming prior baselines.

Implications and Future Directions

The explicit modeling of dual-system cognition in avatar generation represents a significant step toward avatars that exhibit both reactive and deliberative behaviors. The integration of MLLM-based planning with diffusion-based rendering enables avatars to act in ways that are not only physically plausible but also contextually and semantically coherent. This paradigm is extensible to multi-agent, non-human, and interactive scenarios, and provides a foundation for future research in agentic video generation, controllable animation, and embodied AI.

Potential future directions include:

Scaling agentic reasoning to more complex, open-ended tasks and environments.
Integrating real-time user feedback for interactive avatar control.
Extending the framework to multi-agent social simulations and collaborative behaviors.
Investigating the trade-offs between explicit reasoning latents and textual guidance for fine-grained control.

Conclusion

OmniHuman-1.5 introduces a dual-system cognitive simulation framework for video avatar generation, combining MLLM-driven deliberative planning with a multimodal diffusion transformer for reactive rendering. The approach achieves state-of-the-art performance in both objective and subjective evaluations, with strong generalization to diverse avatars and scenarios. The explicit separation and integration of high-level reasoning and low-level synthesis set a new direction for lifelike, semantically coherent digital humans and open new avenues for research in cognitive simulation and agentic generative models.

PDF Markdown

Paper Prompts

Follow-up Questions

Authors (9)

Tweets

https://twitter.com/_akhaliq/status/1960712804052226318

alphaXiv

OmniHuman-1.5: Instilling an Active Mind in Avatars via Cognitive Simulation (14 likes, 0 questions)