Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 33 tok/s
Gemini 2.5 Pro 51 tok/s Pro
GPT-5 Medium 24 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 74 tok/s Pro
Kimi K2 188 tok/s Pro
GPT OSS 120B 362 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

Wan-Animate: Unified Character Animation and Replacement with Holistic Replication (2509.14055v1)

Published 17 Sep 2025 in cs.CV

Abstract: We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene's lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character's appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.

Summary

  • The paper presents a unified framework that integrates static image animation and character replacement through holistic replication.
  • It employs novel control signal decoupling for body motion and facial expressions, achieving high fidelity and temporal coherence.
  • Quantitative and human evaluations demonstrate superior identity consistency and environment adaptation compared to existing SOTA methods.

Wan-Animate: Unified Character Animation and Replacement with Holistic Replication

Introduction and Motivation

Wan-Animate presents a unified framework for character animation and replacement, addressing the lack of holistic solutions in open-source character video synthesis. The method is designed to animate a static character image by replicating the motion and expressions from a reference video, or to replace the character in the reference video with the source identity, ensuring seamless environmental integration. The approach leverages the Wan-I2V video foundation model, introducing a modified input paradigm and novel control signal injection strategies to achieve high-fidelity, temporally coherent, and environment-consistent character video generation. Figure 1

Figure 1: Wan-Animate supports both animation and replacement functionalities, reenacting motion/expression or substituting the character in the reference video with seamless environmental integration.

Model Architecture and Input Formulation

Wan-Animate is built upon the Wan-I2V architecture, with significant modifications to the input formulation to support dual-mode generation (animation and replacement) and holistic control. The input consists of noise latents, conditional latents, and binary masks, unified under a symbolic representation that accommodates reference image injection, temporal frame guidance, and environment information. Figure 2

Figure 2: Overview of Wan-Animate architecture, showing unified input formulation and dual-mode compatibility, with skeleton and implicit face features for control and Relighting LoRA for environment adaptation.

Control Signal Decoupling

  • Body Motion Control: Utilizes spatially-aligned 2D skeleton signals extracted via VitPose, compressed and patchified to align with video latents. Skeleton signals are injected into the noise latents, enabling robust motion replication across diverse character types.
  • Facial Expression Control: Employs implicit features extracted from cropped face images, encoded into 1D latents and temporally aligned with DiT latents. These are injected via temporally-aligned cross-attention in dedicated Face Blocks within the Transformer, facilitating fine-grained expression reenactment. Figure 3

    Figure 3: Face images are encoded into frame-wise implicit latents, temporally aligned and injected via cross-attention for precise expression control.

Environment Integration: Relighting LoRA

For character replacement, an auxiliary Relighting LoRA module is trained to harmonize the character's lighting and color tone with the new environment. Training data is synthesized using IC-Light, which imposes consistent light transport and color adaptation. Figure 4

Figure 4: Data augmentation examples using IC-Light for training Relighting LoRA, enabling environment-consistent lighting adaptation.

Training Pipeline and Inference Strategies

Wan-Animate employs a progressive training pipeline:

  1. Body Control Training: Initial training with body motion signals only, establishing alignment and input paradigm adaptation.
  2. Face Control Training: Integration of Face Adapter and Face Block modules, focusing on portrait data and expression-driven animation.
  3. Joint Control Training: Combined training on full datasets for holistic control.
  4. Joint Mode Training: Inclusion of both animation and replacement formats for dual-mode compatibility.
  5. Relighting LoRA Training: Exclusive training for environment adaptation in replacement mode.

Inference supports pose retargeting for cross-identity animation, with bone length and position adjustments based on reference and source images. Long video generation is achieved via iterative segment-wise synthesis, using temporal guidance from previous segments.

Experimental Results

Quantitative and Human Evaluation

Wan-Animate demonstrates superior performance over existing open-source and closed-source frameworks in both animation and replacement tasks. Quantitative metrics (SSIM, LPIPS, FVD) indicate higher fidelity, realism, and temporal coherence. Human evaluation against SOTA commercial solutions (Runway Act-two, DreamActor-M1) shows clear preference for Wan-Animate in terms of identity consistency, motion accuracy, and expression fidelity. Figure 5

Figure 5: Human evaluation results comparing Wan-Animate with current SOTA, showing clear user preference for Wan-Animate outputs.

Qualitative Comparisons

Wan-Animate outperforms Animate Anyone, VACE, and commercial solutions in both animation and replacement modes, exhibiting stable performance, robust identity preservation, and seamless environmental integration. Figure 6

Figure 6: Qualitative comparison for Animation Mode, highlighting superior motion and expression fidelity.

Figure 7

Figure 7: Qualitative comparison for Replacement Mode, demonstrating robust identity consistency and environment fusion.

Ablation Studies

Progressive training of the Face Adapter significantly improves convergence and expression accuracy. The Relighting LoRA module is shown to be essential for realistic environment adaptation without compromising character identity. Figure 8

Figure 8: Ablation paper on Face Adapter Training, showing improved expression accuracy with progressive training.

Figure 9

Figure 9: Ablation paper of Relighting LoRA, illustrating enhanced lighting and color tone adaptation in replacement mode.

Application Diversity

Wan-Animate supports a wide range of applications, including performance reenactment, cross-style transfer, complex motion synthesis, dynamic camera movement, and character replacement in commercial and entertainment contexts. Figure 10

Figure 10: Qualitative results for various applications, demonstrating the versatility of Wan-Animate across scenarios.

Implementation and Scalability Considerations

Wan-Animate is trained on large-scale human-centric video datasets with rigorous quality and identity consistency filtering. The training pipeline leverages FSDP and context parallelism (RingAttention, Ulysses) for efficient scaling. The model supports arbitrary output resolutions and efficient inference via segment-wise synthesis and optional classifier-free guidance for fine-grained control.

Implications and Future Directions

Wan-Animate sets a new baseline for open-source character animation and replacement, providing a unified, high-fidelity solution with robust control over motion, expression, and environment adaptation. The modular architecture and input paradigm facilitate rapid adaptation to new tasks and efficient post-training. The open-sourcing of model weights and code is expected to accelerate research and application development in character video synthesis, lowering the barrier for real-world deployment and fostering innovation in digital content creation.

Potential future directions include:

  • Extension to non-humanoid and multi-character scenarios via generalized control signals.
  • Integration of multimodal inputs (audio, text) for richer animation control.
  • Further optimization of environment adaptation for complex lighting and interaction contexts.
  • Exploration of real-time inference and interactive editing capabilities.

Conclusion

Wan-Animate introduces a unified framework for character animation and replacement, achieving holistic replication of motion, expression, and environmental context. The method demonstrates superior quantitative and qualitative performance, robust scalability, and broad application potential. Its open-source release is poised to advance the state of the art in character video synthesis and catalyze further research and practical adoption in the field.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What is this paper about?

This paper introduces Wan‑Animate, an AI system that can turn a single picture of a character into a moving video. It can do two things:

  • Animate a character so it copies the body movements and facial expressions from a reference video.
  • Replace the person in a video with a new character, while matching the scene’s lighting and colors so it looks natural.

Think of it like high‑tech puppeteering: the reference video is the “puppet master,” and the still image is the “puppet.” Wan‑Animate learns to move the puppet’s body and face the same way the puppet master does, and if you want, it can place that puppet directly into the puppet master’s stage.

What questions did the researchers ask?

In simple terms, they wanted to know:

  • Can one model handle both character animation (making a picture move) and character replacement (swapping someone in a video) well?
  • Can it copy both body motion and subtle facial expressions accurately?
  • Can it make the replaced character blend into the scene’s lighting and colors so it looks real, not pasted on?

How does Wan‑Animate work?

Wan‑Animate builds on a powerful video generator called Wan (a “Diffusion Transformer,” which is just a big model that learns to turn random noise into realistic video, step by step).

The two modes

  • Animation Mode: It keeps the background from the input picture and moves the character to match the reference video’s motion and expressions.
  • Replacement Mode: It keeps the background from the reference video but swaps in the new character, while adjusting light and color so the new character fits the scene.

You choose the mode by changing how you feed inputs into the model (like checking different boxes on a form). The team designed a unified “input setup” so the same model can do both tasks.

Driving the body: “stick‑figure” control

For body motion, Wan‑Animate uses a 2D skeleton (like a stick‑figure with joints). This is extracted from the reference video and aligned with the target character. The skeleton is robust for many character types (including stylized ones), and it tells the model where arms, legs, and torso should move.

Analogy: Imagine tracing a dancer’s pose as a stick figure frame by frame, then using that stick figure to guide your animated character.

Reenacting the face: “expression signals,” not landmarks

Facial expressions are tricky—small details matter. Instead of using face landmarks (dots on eyes, nose, mouth), the model uses the actual cropped face images from the reference video. But it processes them to keep expression info while stripping away identity details, so it doesn’t accidentally make the target character look like the person in the video.

Two key tricks help:

  • Compressing face images into a compact “latent” so low‑level identity traits don’t carry over.
  • Random augmentations (like tiny color or scale changes) so the model learns expressions, not identity.

These face “expression signals” are injected into the model so that at each moment it knows exactly how the eyebrows, eyes, mouth, and cheeks should move.

Fitting into the scene: Relighting LoRA

When replacing a character in a video, the lighting on the new character might not match the room (too bright, wrong color, etc.). Wan‑Animate adds a small extra module (called a LoRA) that tweaks lighting and color tones on the character to match the environment.

Analogy: Think of it as automatic “makeup and lighting correction,” so the character doesn’t look out of place.

Training in stages

To make learning stable and fast, the team trains in steps: 1) Learn body motion control. 2) Add facial expression control on portrait‑focused data (so faces are clear and large). 3) Train both together. 4) Teach the model to handle both Animation and Replacement modes. 5) Train the Relighting LoRA to match scene lighting.

Making it work in practice

  • Pose retargeting: People have different body proportions. For animation mode, the system rescales bones (like adjusting a puppet’s limb lengths) so the motion fits the target character better.
  • Long videos: It generates in chunks and uses the last frames as a guide for the next chunk, keeping motion smooth over time.

What did they find?

  • Higher quality and more stable videos: Compared to many open‑source methods, Wan‑Animate makes characters look more realistic, keeps their identity consistent, and produces smoother motion over time.
  • Better expressions: Using face images as expression signals captures subtle facial movements (like eye and mouth changes) more accurately than landmarks.
  • Strong replacement results: With the Relighting LoRA, the swapped character matches the scene’s light and color better, so the final video looks more natural.
  • User preference: In side‑by‑side comparisons against strong commercial systems, more people preferred Wan‑Animate’s results.
  • Versatility: It works across portraits, half‑body, and full‑body shots, and even with different character styles.

Why this matters: It pushes open‑source character animation closer to the best commercial tools, making high‑quality video editing and animation more accessible.

Why it matters and what could happen next?

  • Creative uses: Filmmakers, advertisers, and creators can quickly animate characters, recreate performances, or swap actors while keeping scenes believable.
  • Faster workflows: A single, unified tool that handles both animation and replacement can save time and reduce complexity.
  • Open source impact: The team plans to release the model and code, which can speed up research, lower costs for creators, and inspire new apps.
  • Responsible use: Because this tech can change who appears in a video and how they move or emote, it should be used ethically—get consent, avoid deception, and consider watermarking or disclosure to prevent misuse.

In short, Wan‑Animate is like a Swiss Army knife for character video creation: it animates, replaces, and relights—with special care for realistic motion, expressive faces, and seamless scene blending.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, structured to guide concrete follow-up research:

  • Data transparency and bias: the composition, size, licensing, demographics, and domain diversity of the training corpus are not disclosed; potential demographic and content biases remain uncharacterized.
  • Generalization beyond humans: despite claiming robustness to “humanoid” characters, the model is trained on human-centric videos; performance on non-human, stylized (anime, cartoons), or 3D/CGI characters is untested and unevaluated.
  • Multi-person scenarios: training explicitly assumes a single consistent character per clip; support for multi-subject scenes, identity disambiguation, and cross-subject occlusions is not addressed.
  • Occlusion and interaction modeling: handling of self-occlusions, human–object contacts, inter-person occlusion ordering, and physically plausible interactions is not modeled or evaluated.
  • Replacement segmentation robustness: character masks rely on SAM2 and skeleton tracking; robustness to motion blur, heavy occlusions, fast motion, complex clothing, and scene clutter is not quantified.
  • 2D skeleton limitations: the known ambiguities and failure modes of 2D keypoints (missing joints, depth ambiguity, foreshortening) are acknowledged but not systematically mitigated or measured; no comparison to 3D motion representations (e.g., SMPL, 3D keypoints) is provided.
  • Hand and finger articulation: control signals appear coarse (body joints only); explicit hand/finger keypoints and fine-grained gesture control are not covered.
  • Pose retargeting constraints: retargeting is disabled in Replacement Mode to preserve interactions, causing deformation when source/target body shapes differ; no interaction-preserving retargeting or shape-aware adaptation is explored.
  • T-pose canonicalization reliability: the proposed T-pose editing (via external image-editing models) used to compute limb-length ratios lacks reliability analysis, failure characterization, or alternatives that avoid external editing.
  • Facial identity leakage: the paper proposes augmentations and feature orthogonalization to reduce identity leakage in face features, but does not quantify leakage risks (e.g., via face-recognition similarity across cross-ID reenactment).
  • Expression controllability: there is no explicit control over expression intensity, style, or temporal dynamics beyond optional CFG; no “expression scale” or disentangled control interface is studied.
  • Cross-modal control gaps: audio-driven lip synchronization, speech-driven facial dynamics, and language-conditioned expression/motion control are not supported or evaluated.
  • Camera motion handling: the method claims to handle dynamic camera movement but lacks explicit modeling (e.g., camera trajectory estimation) and provides no targeted evaluation on large camera motion or zooms.
  • Long-horizon stability: the iterative long-video scheme (segment stitching with 1–5-frame temporal guidance) is prone to drift or seam artifacts; no quantitative long-horizon stability/flicker metrics are reported.
  • 3D/geometry awareness: the approach is purely 2D; view-consistent identity under large yaw/pitch rotations and multi-view geometric consistency are not addressed or assessed.
  • Relighting scope limits: the Relighting LoRA is trained on IC-Light composites; generalization to dynamic lighting, multi-source/colored lights, hard cast shadows, interreflections, and environment-induced occlusions is untested.
  • Scene–subject photometric coupling: beyond adjusting the subject’s tone/illumination, the method does not synthesize physically consistent cast shadows or reflected light onto the environment; no metrics/user studies specifically evaluate environmental photometric integration.
  • Temporal consistency of relighting: potential frame-to-frame flicker introduced by the Relighting LoRA is not analyzed; no temporal photometric stability metrics are provided.
  • Segmentation boundary quality: boundary artifacts at subject–background edges (haloing, color spill) are not discussed or evaluated, especially under fast motion and motion blur.
  • Component-level ablations: only two ablations are shown (face training curriculum, relighting); missing ablations include face-injection layer schedule, injection frequency, pose vs no-pose, 2D vs 3D motion, temporal-latent count, and alternative fusion strategies.
  • Error tolerance and robustness: failure modes under noisy/missing keypoints, poor face crops, extreme expressions, occluded faces, unusual clothing/hair, and low-resolution inputs are not systematically studied.
  • Evaluation benchmarks: quantitative results rely on an internal benchmark; there is no evaluation on public, standardized datasets, nor a release of the internal test set to enable reproducible comparisons.
  • Metrics for identity and motion fidelity: no identity-preservation metric (e.g., face embedding similarity), motion-following error (e.g., keypoint reprojection error), facial Action Unit accuracy, or lip-sync metrics are reported.
  • Environmental integration metrics: no objective measures for photometric/illumination consistency (e.g., re-illumination error, shadow consistency) or perceptual realism focused on replacement quality are provided.
  • User paper design: the subjective paper uses 20 participants and pairwise preferences but lacks details on sample size, statistical significance tests, inter-rater agreement, and task diversity.
  • Efficiency and deployment: inference latency, throughput, GPU memory requirements, and scalability for real-time or streaming applications are not reported.
  • CFG trade-offs: the impact of enabling CFG (global or face-only) on quality, adherence to control signals, and temporal stability is not analyzed.
  • Text prompt influence: although text is supported, its effect on motion/expression or scene control and potential conflicts with motion signals are not studied.
  • Multilingual and cross-domain robustness: robustness to varied cultural facial expressions, makeup, masks, headwear, and domain shifts (e.g., IR/night videos) is unreported.
  • Multi-identity replacement: replacing multiple subjects in the same scene, identity assignment, and consistency maintenance across subjects and time are unaddressed.
  • Provenance and safety: no discussion of consent, watermarking, detection of manipulated content, or safeguards against misuse (e.g., non-consensual identity swaps) is provided.
  • Privacy and memorization: risks of training data memorization, identity/privacy leakage, and membership inference are not assessed.
  • Reproducibility specifics: while code/weights are promised, concrete training hyperparameters, data preprocessing pipelines, and benchmark release timelines are unspecified.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Ablation paper: A controlled experiment that removes or modifies components to assess their contribution. Example: The paper compares progressive face-adapter training against a joint baseline and shows the baseline struggles to converge.
  • Binary mask: A tensor of 0/1 values indicating which regions or frames are preserved (1) or generated (0). Example: In Replacement Mode, the mask is 1 for environment regions and 0 for the subject to preserve the original background.
  • Body Adapter: A lightweight module that encodes and injects pose (skeleton) signals into the generative model. Example: Pose frames are compressed by the Wan-VAE, patchified, and added to the patchified noise latents via the Body Adapter.
  • Causal 1D convolutional layers: Sequence convolutions that only use current and past timesteps to produce outputs, preserving causality. Example: Used to temporally downsample face latents so their length matches DiT latents.
  • Classifier-Free Guidance (CFG): A diffusion technique that balances conditional and unconditional predictions to control fidelity vs. diversity. Example: Disabled by default for efficiency; optionally enabled for face conditioning to fine-tune expression reenactment.
  • Context Parallelism: A training strategy that partitions long sequences across devices to reduce memory and speed up training. Example: The DiT model is trained with Context Parallelism combining RingAttention and Ulysses.
  • Cross-attention: An attention mechanism where query tokens attend to conditioning tokens from another source. Example: Implicit face latents are injected via temporally aligned cross-attention inside dedicated Face Blocks.
  • Cross-identity (Cross-ID) animation: Driving one character’s image with another identity’s motion and expressions. Example: The user paper evaluates cross-ID setups where Wan-Animate animates a target identity using a different driving video.
  • Data Parallelism (DP): Training by replicating the model across GPUs and splitting the batch among them. Example: Non-memory-intensive models (VAE, CLIP) are trained with standard DP.
  • Denoising Diffusion Probabilistic Models (DDPM): Generative models that learn to reverse a noise-adding process. Example: The paper contrasts older GAN/warping methods with diffusion-based approaches like DDPM that improved animation quality.
  • Diffusion Transformer (DiT): A Transformer-based diffusion architecture for image/video generation. Example: Wan-Animate is built on the DiT-based Wan-I2V foundation, benefiting from strong temporal consistency.
  • Face Adapter: An encoder module that extracts implicit facial features from raw face images for expression control. Example: Face images are encoded to 1D latents, augmented, and later injected via cross-attention.
  • Face Blocks: Specific Transformer layers designated to receive and fuse face conditioning features. Example: The paper injects face information every 5 layers in the 40-layer Wan-14B network (total 8 Face Blocks).
  • Fréchet Video Distance (FVD): A metric assessing video realism and temporal coherence via distributional distance. Example: FVD is reported in quantitative comparisons to demonstrate Wan-Animate’s state-of-the-art performance.
  • Fully Sharded Data Parallelism (FSDP): A training approach that shards model parameters, gradients, and optimizer states across devices. Example: DiT and T5 are trained with FSDP to reduce per-GPU memory footprint.
  • IC-Light: A relighting/compositing method that transfers lighting from a new background onto a subject. Example: IC-Light is used to create training pairs for the Relighting LoRA by placing segmented characters onto random backgrounds with new lighting.
  • Image-to-Video (I2V): Generating a video conditioned on a single input image. Example: Animation Mode is analogous to I2V, preserving the source image’s background while driving motion and expressions from the reference video.
  • Implicit facial features: Learned latent representations extracted directly from face images without manual landmarks. Example: The model encodes raw faces into latents that disentangle expressions from identity and injects them via attention.
  • Latent: A compressed representation of images/videos used as model inputs or conditions. Example: Wan-Animate uses reference latents, conditional latents, and noise latents; pose and face latents are aligned and injected into DiT.
  • Learned Perceptual Image Patch Similarity (LPIPS): A perceptual metric comparing image patches using deep features. Example: LPIPS is reported for reconstruction quality in the quantitative evaluation.
  • Linear Motion Decomposition: A technique to orthogonalize feature components, separating motion-related signals. Example: Applied in the Face Adapter to better disentangle expression information from identity cues.
  • Low-Rank Adaptation (LoRA): A parameter-efficient fine-tuning method that adds low-rank matrices to existing layers. Example: The Relighting LoRA is trained on attention layers to adjust a replaced character’s lighting and color tone.
  • Patchifying: Converting continuous latent features into fixed-size patches (tokens) for Transformer processing. Example: Video latents are discretized by patchifying; pose latents are also patchified and added to noise tokens.
  • Pose retargeting: Adjusting skeletal poses to match differences in limb proportions or position between identities. Example: During Animation Mode, limb lengths and positions are retargeted, optionally using T-pose edits to compute scale factors.
  • ReferenceNet: A network component that injects appearance features from a reference image/video into the generator. Example: Prior work (Animate Anyone) uses ReferenceNet for consistency preservation; Wan-Animate adopts related ideas via reference latents.
  • Relighting LoRA: A LoRA specialized for adapting the character’s lighting and color tone to the target environment. Example: Trained with IC-Light composites, it improves realism in Replacement Mode without breaking identity.
  • RingAttention: An attention parallelization scheme enabling context-splitting across devices. Example: Combined with Ulysses in the paper’s Context Parallelism to train long sequence DiT models efficiently.
  • SAM2 (Segment Anything Model v2): A segmentation model that produces masks for arbitrary objects. Example: Used to extract character masks from videos for Replacement Mode’s environment formulation.
  • Skeleton-based representation: A 2D keypoint structure representing body poses over time. Example: Chosen over SMPL for generality; pose keypoints from VitPose are aligned and injected to control motion.
  • Skinned Multi-Person Linear (SMPL): A parametric 3D human body model with shape and pose parameters. Example: Discussed as an alternative motion signal; rejected due to shape leakage and poorer generalization to non-humanoids.
  • Structural Similarity Index (SSIM): A metric that quantifies perceptual image similarity using luminance, contrast, and structure. Example: Reported in quantitative comparisons alongside LPIPS and FVD.
  • Temporal guidance: Conditioning generation on previously generated frames to maintain continuity. Example: The last 1 or 5 frames of a segment are reused as condition latents (mask=1) to drive the next segment.
  • Three-dimensional VAE (3D VAE): A VAE that compresses video spatially and temporally before tokenization. Example: Used to reduce sequence length so DiT can process video tokens efficiently.
  • Tokenization (video tokens): Representing compressed video as discrete patch tokens for Transformer input. Example: The paper selects output resolution based on a target token count after patchify.
  • T-pose: A standard reference posture with arms outstretched used for measuring body proportions. Example: The authors edit images to T-pose to compute more accurate limb length scaling for retargeting.
  • Ulysses: A sequence-parallel framework for splitting long sequences across devices. Example: Paired with RingAttention to implement Context Parallelism for training the DiT.
  • Variational Autoencoder (VAE): A probabilistic encoder-decoder that learns low-dimensional latent representations. Example: Wan-VAE encodes reference images and compresses pose frames to align with target latents.
  • Video-to-Video (V2V): Translating one video into another conditioned on content/motion. Example: Replacement Mode corresponds to V2V, integrating the source character into the reference video environment.
  • VitPose: A high-accuracy 2D human pose estimation model. Example: Used to extract skeleton keypoints that drive body motion in Wan-Animate.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 20 posts and received 1090 likes.

Youtube Logo Streamline Icon: https://streamlinehq.com