- The paper introduces FlexiAct, an Image-to-Video diffusion framework designed for flexible action transfer in heterogeneous scenarios, enabling action adaptation to diverse subjects despite significant spatial structure differences.
 
        - FlexiAct employs two novel components: RefAdapter for spatial adaptation and appearance consistency via LoRA trained on random frames, and Frequency-aware Action Extraction (FAE) using dynamically reweighted learned embeddings based on denoising timesteps for precise action control.
 
        - FlexiAct demonstrates superior performance in transferring actions to various subjects and domains while preserving appearance but notes a limitation requiring per-video training for its FAE component.
 
    
   
 
      FlexiAct addresses the challenging problem of action transfer in video generation, particularly in heterogeneous scenarios where the subject in the target image differs significantly in spatial structure (layout, skeleton, viewpoint) from the subject in the reference video. Existing methods often require strict alignment between the source and target or are limited to global motion transfer without subject-specific adaptation. FlexiAct is an Image-to-Video (I2V) diffusion framework built upon CogVideoX-I2V (Yang et al., 12 Aug 2024) that overcomes these limitations by enabling flexible action adaptation to diverse subjects while maintaining appearance consistency.
The core of FlexiAct lies in two novel components: RefAdapter and Frequency-aware Action Extraction (FAE).
RefAdapter for Spatial Adaptation and Consistency:
RefAdapter is a lightweight, image-conditioned adapter designed to facilitate spatial structure adaptation and preserve appearance consistency. It is implemented by injecting LoRA (Low-Rank Adaptation) (Hu et al., 2021) into the MMDiT layers of the base I2V model. The key innovation in RefAdapter's training is breaking the typical I2V constraint of conditioning only on the first frame. Instead, it is trained by using randomly sampled frames from untrimmed videos as condition images. This introduces a gap between the condition image's spatial structure and the typical starting point of the video, allowing the model to learn how to adapt to diverse initial spatial structures. Additionally, the first embedding along the temporal dimension of the video latent (Lvideo) is replaced with the image latent (Limage). This allows the model to use the image as a reference for guidance rather than a constraint for the first frame. RefAdapter is trained once for 40,000 steps on the Miradata dataset (Ju et al., 8 Jul 2024) using AdamW optimizer with a learning rate of 1e-5 and a batch size of 8. It adds only 66 million parameters, which is about 5% of the backbone model's parameters, making it computationally efficient for training compared to methods involving full module duplication like ControlNet (Zhang et al., 2023) or ReferenceNet (He et al., 23 Apr 2024).
Frequency-aware Action Extraction (FAE) for Precise Action Control:
FAE is designed to precisely extract action information from a reference video and transfer it to the target subject. It uses a set of learnable embeddings concatenated to the inputs of the MMDiT layers. These embeddings are trained per reference video, typically requiring 1,500 to 3,000 steps. During FAE training, random cropping is applied to the input video to prevent the embeddings from overfitting to the reference video's specific layout. A key observation guiding FAE's design is that diffusion models focus on low-frequency motion details in early denoising steps (large timesteps) and high-frequency appearance details in later steps (small timesteps), as visualized by attention maps (Figure 1 in the paper).
Leveraging this, FAE performs action extraction during the inference process by dynamically adjusting the attention weights of video tokens to the frequency-aware embeddings based on the current denoising timestep. The reweighting strategy is formulated as a bias Wbias added to the original attention weight Wori.
The bias Wbias is highest (α) at large timesteps (tl≤t≤T) to prioritize low-frequency motion information. It smoothly transitions down (2α[cos(…)+1]) in intermediate timesteps (th≤t<tl) and becomes zero ($0$) at small timesteps (0≤t<th) where high-frequency details are processed. In practice, the paper uses α=1, th=700, and tl=800. This dynamic modulation guides the generation process to emphasize the reference action at the most relevant stages of denoising.
Training and Inference Pipeline:
The FlexiAct pipeline involves a two-stage training process and a specific inference strategy (illustrated in Figure 2).
- Stage 1 (RefAdapter Training): Train the RefAdapter (LoRA layers) on a large dataset like Miradata, conditioning on random frames to enhance adaptability.
 
- Stage 2 (FAE Training): For each reference video, train the frequency-aware embeddings with random cropping, without loading the RefAdapter.
 
- Inference: Load the trained RefAdapter. Use the target image as the condition for the I2V model. Incorporate the trained frequency-aware embeddings for the desired reference action. Apply the dynamic attention reweighting strategy of FAE during the denoising process based on the current timestep.
 
Practical Applications and Results:
FlexiAct enables flexible action customization for arbitrary target images, including real humans, animals, and animated characters, using reference videos of subjects with potentially different spatial structures. This is valuable for applications like character animation in games or films, creating personalized video content, or generating animated assets from still images.
The paper evaluates FlexiAct on a diverse benchmark dataset of 250 video-image pairs. Quantitative results (Table 1) show that FlexiAct significantly outperforms baseline methods like an I2V adaptation of MotionDirector (Zhao et al., 2023) (MD-I2V) and a base model using standard learnable embeddings in both Motion Fidelity and Appearance Consistency. Qualitative results (Figures 4, 5, 6, 7) demonstrate its capability to accurately transfer complex actions (e.g., stretching, turning, jumping) to subjects with varying body shapes, viewpoints, and even different domains (human-to-animal, animal-to-animal) while preserving the target subject's appearance details.
Limitations:
A key limitation mentioned is that FAE requires optimization for each reference video. This per-video training step can be time-consuming and resource-intensive, making it less suitable for applications requiring real-time or on-the-fly action transfer from new reference videos. Future work aims to develop feed-forward methods that can achieve similar performance without per-video training.