AniX: Animating Characters in 3DGS Scenes
- AniX is a system for animating characters in static 3DGS scenes using conditional autoregressive video generation, ensuring temporal coherence and visual fidelity.
- It employs a conditional framework that integrates multi-view character images, scene masks, and textual instructions with a Transformer enhanced by LoRA modules.
- The approach accelerates inference via diffusion model distillation, reducing a 30-step process to a 4-step schedule with minimal quality loss.
AniX is a system for animating any user-specified character within any static 3D Gaussian Splatting (3DGS) scene under natural language instruction, synthesizing temporally coherent video clips while preserving the visual fidelity and structural grounding of the input scene and character. The core innovation of AniX is the conditional, autoregressive generation of video conditioned on scene, character, mask, and text instruction, enabling open-ended character actions and object-centric interactions that generalize beyond simple locomotion and limited controllability (Wang et al., 18 Dec 2025).
1. Conditional Autoregressive Video Generation
AniX formulates the character animation task as a conditional autoregressive video generation problem. At interaction step , the system generates a new video clip conditioned on (i) the previous clip (if any), (ii) the given static 3DGS scene , (iii) multi-view character images , (iv) a character anchor mask per frame, and (v) the current text instruction . In latent token space (after VAE encoding), this conditional generation is represented as:
Decomposed autoregressively over frames :
Instead of maximizing log-likelihood, AniX minimizes a continuous-time velocity-matching loss (“Flow Matching”). Defining:
- as noise ,
- as the ground-truth token sequence ,
- For , ,
- Ground-truth “velocity”: ,
the model is trained to match via MSE:
Conditioning includes projecting scene and mask tokens into the token feature space and concatenating token embeddings for text, character views, previous video, and input tokens before processing with the Transformer.
2. Model Backbone and Conditional Encoding
AniX is based on a pre-trained “HunyuanCustom” video generator, consisting of a VQ-VAE and a Multimodal Diffusion Transformer (MMDiT). The VQ-VAE encoder reduces input spatially by ×8 and temporally by ×4. The decoder and the full-attention Transformer stack (≈13B parameters) are frozen, except for LoRA modules (rank 64) which are trainable within each attention and feed-forward layer.
The system encodes its conditions as follows:
- Scene (): Rendered as a “scene video” by splatting the input 3DGS along a predefined camera path, then VAE-encoded into token sequence .
- Character (): Represented by four canonical multi-view images (front, left, right, back); each view is VAE-encoded into tokens .
- Mask (): Per-frame binary mask around the character, VAE-encoded into tokens to help delineate “dynamic” from “static” regions.
- Text (): Encoded with a frozen LLaVA multi-modal encoder, using both the instruction and the set of character views to produce text-token embeddings .
The fusion strategy involves projecting and and summing them with , then concatenating all condition tokens along the sequence for Transformer input.
3. Temporal Coherence and Autoregressive Conditioning
To support long-horizon character behaviors and robust temporal coherence, AniX employs an autoregressive mode. Each clip’s target tokens () are split temporally: the first quarter () and the remaining three-quarters (). During training, (with Gaussian jitter) is used as extra conditioning for predicting , in conjunction with . At inference, is taken from the previously generated output, enforcing inter-clip consistency.
Standard 3D rotary positional embeddings (3D-RoPE; time×height×width) are applied to video tokens. For character-view sequences, “shifted” 3D-RoPE prevents positional embedding collisions between views. No positional embeddings are applied to text tokens.
4. Training Workflow and Data
AniX is trained in two distinct stages:
- Stage 1: The base model ("HunyuanCustom") is pre-trained on large-scale, broad-coverage text-to-video data.
- Stage 2: Fine-tuning is conducted only on LoRA modules using a curated GTA-V dataset (“locomotion-and-camera” post-training), sharpening motion dynamics and camera tracking without compromising generalization.
The training data pipeline involves:
- 2,084 GTA-V gameplay clips (129 frames/clip, five characters, 4 locomotion + 2 camera motion patterns).
- For each clip: character segmentation to create mask , inpainting the background for isolated scene video , labeling with a short action text , and rendering 3DVS character models as four multi-view images .
Key training strategies include scene-condition dropout (), Gaussian jitter on preceding video tokens for robust autoregressive conditioning, and minimal regularization due to reliance on the frozen foundation model’s priors.
5. Acceleration and Inference Optimization
After full model training, AniX applies diffusion model distillation (DMD2) to convert the original 30-step diffusion schedule into a 4-step process, resulting in approximately faster inference with minimal visual or temporal fidelity loss. Only LoRA modules within the student and fake-score networks are fine-tuned during distillation.
6. Evaluation and Capabilities
AniX is evaluated on visual quality, character consistency, action controllability, and long-horizon coherence. The system is designed to generalize across actions and characters, providing user-driven, text-conditioned animation in complex 3DGS environments. Users can direct a character across a 3D scene to perform diverse actions—ranging from basic locomotion to object-centric behaviors—over arbitrary time horizons, with each clip seamlessly building on prior context while maintaining structural integrity and visual continuity throughout (Wang et al., 18 Dec 2025).