FlexiAct: Towards Flexible Action Control in Heterogeneous Scenarios (2505.03730v1)

Published 6 May 2025 in cs.CV, cs.AI, and cs.MM

Abstract: Action customization involves generating videos where the subject performs actions dictated by input control signals. Current methods use pose-guided or global motion customization but are limited by strict constraints on spatial structure, such as layout, skeleton, and viewpoint consistency, reducing adaptability across diverse subjects and scenarios. To overcome these limitations, we propose FlexiAct, which transfers actions from a reference video to an arbitrary target image. Unlike existing methods, FlexiAct allows for variations in layout, viewpoint, and skeletal structure between the subject of the reference video and the target image, while maintaining identity consistency. Achieving this requires precise action control, spatial structure adaptation, and consistency preservation. To this end, we introduce RefAdapter, a lightweight image-conditioned adapter that excels in spatial adaptation and consistency preservation, surpassing existing methods in balancing appearance consistency and structural flexibility. Additionally, based on our observations, the denoising process exhibits varying levels of attention to motion (low frequency) and appearance details (high frequency) at different timesteps. So we propose FAE (Frequency-aware Action Extraction), which, unlike existing methods that rely on separate spatial-temporal architectures, directly achieves action extraction during the denoising process. Experiments demonstrate that our method effectively transfers actions to subjects with diverse layouts, skeletons, and viewpoints. We release our code and model weights to support further research at https://shiyi-zh0408.github.io/projectpages/FlexiAct/

Summary

The paper introduces FlexiAct, an Image-to-Video diffusion framework designed for flexible action transfer in heterogeneous scenarios, enabling action adaptation to diverse subjects despite significant spatial structure differences.
FlexiAct employs two novel components: RefAdapter for spatial adaptation and appearance consistency via LoRA trained on random frames, and Frequency-aware Action Extraction (FAE) using dynamically reweighted learned embeddings based on denoising timesteps for precise action control.
FlexiAct demonstrates superior performance in transferring actions to various subjects and domains while preserving appearance but notes a limitation requiring per-video training for its FAE component.

FlexiAct addresses the challenging problem of action transfer in video generation, particularly in heterogeneous scenarios where the subject in the target image differs significantly in spatial structure (layout, skeleton, viewpoint) from the subject in the reference video. Existing methods often require strict alignment between the source and target or are limited to global motion transfer without subject-specific adaptation. FlexiAct is an Image-to-Video (I2V) diffusion framework built upon CogVideoX-I2V (Yang et al., 12 Aug 2024) that overcomes these limitations by enabling flexible action adaptation to diverse subjects while maintaining appearance consistency.

The core of FlexiAct lies in two novel components: RefAdapter and Frequency-aware Action Extraction (FAE).

RefAdapter for Spatial Adaptation and Consistency:

RefAdapter is a lightweight, image-conditioned adapter designed to facilitate spatial structure adaptation and preserve appearance consistency. It is implemented by injecting LoRA (Low-Rank Adaptation) (Hu et al., 2021) into the MMDiT layers of the base I2V model. The key innovation in RefAdapter's training is breaking the typical I2V constraint of conditioning only on the first frame. Instead, it is trained by using randomly sampled frames from untrimmed videos as condition images. This introduces a gap between the condition image's spatial structure and the typical starting point of the video, allowing the model to learn how to adapt to diverse initial spatial structures. Additionally, the first embedding along the temporal dimension of the video latent ( $L_{video}$ ) is replaced with the image latent ( $L_{image}$ ). This allows the model to use the image as a reference for guidance rather than a constraint for the first frame. RefAdapter is trained once for 40,000 steps on the Miradata dataset (Ju et al., 8 Jul 2024) using AdamW optimizer with a learning rate of 1e-5 and a batch size of 8. It adds only 66 million parameters, which is about 5% of the backbone model's parameters, making it computationally efficient for training compared to methods involving full module duplication like ControlNet (Zhang et al., 2023) or ReferenceNet (He et al., 23 Apr 2024).

Frequency-aware Action Extraction (FAE) for Precise Action Control:

FAE is designed to precisely extract action information from a reference video and transfer it to the target subject. It uses a set of learnable embeddings concatenated to the inputs of the MMDiT layers. These embeddings are trained per reference video, typically requiring 1,500 to 3,000 steps. During FAE training, random cropping is applied to the input video to prevent the embeddings from overfitting to the reference video's specific layout. A key observation guiding FAE's design is that diffusion models focus on low-frequency motion details in early denoising steps (large timesteps) and high-frequency appearance details in later steps (small timesteps), as visualized by attention maps (Figure 1 in the paper).

Leveraging this, FAE performs action extraction during the inference process by dynamically adjusting the attention weights of video tokens to the frequency-aware embeddings based on the current denoising timestep. The reweighting strategy is formulated as a bias $W_{bias}$ added to the original attention weight $W_{ori}$ . The bias $W_{bias}$ is highest ( $\alpha$ ) at large timesteps ( $t_l \leq t \leq T$ ) to prioritize low-frequency motion information. It smoothly transitions down ( $\frac{\alpha}{2} [\cos(\dots)+1]$ ) in intermediate timesteps ( $t_h \leq t < t_l$ ) and becomes zero ($0$) at small timesteps ( $0 \leq t < t_h$ ) where high-frequency details are processed. In practice, the paper uses $\alpha=1$ , $t_h=700$ , and $t_l=800$ . This dynamic modulation guides the generation process to emphasize the reference action at the most relevant stages of denoising.

Training and Inference Pipeline:

The FlexiAct pipeline involves a two-stage training process and a specific inference strategy (illustrated in Figure 2).

Stage 1 (RefAdapter Training): Train the RefAdapter (LoRA layers) on a large dataset like Miradata, conditioning on random frames to enhance adaptability.
Stage 2 (FAE Training): For each reference video, train the frequency-aware embeddings with random cropping, without loading the RefAdapter.
Inference: Load the trained RefAdapter. Use the target image as the condition for the I2V model. Incorporate the trained frequency-aware embeddings for the desired reference action. Apply the dynamic attention reweighting strategy of FAE during the denoising process based on the current timestep.

Practical Applications and Results:

FlexiAct enables flexible action customization for arbitrary target images, including real humans, animals, and animated characters, using reference videos of subjects with potentially different spatial structures. This is valuable for applications like character animation in games or films, creating personalized video content, or generating animated assets from still images.

The paper evaluates FlexiAct on a diverse benchmark dataset of 250 video-image pairs. Quantitative results (Table 1) show that FlexiAct significantly outperforms baseline methods like an I2V adaptation of MotionDirector (Zhao et al., 2023) (MD-I2V) and a base model using standard learnable embeddings in both Motion Fidelity and Appearance Consistency. Qualitative results (Figures 4, 5, 6, 7) demonstrate its capability to accurately transfer complex actions (e.g., stretching, turning, jumping) to subjects with varying body shapes, viewpoints, and even different domains (human-to-animal, animal-to-animal) while preserving the target subject's appearance details.

Limitations:

A key limitation mentioned is that FAE requires optimization for each reference video. This per-video training step can be time-consuming and resource-intensive, making it less suitable for applications requiring real-time or on-the-fly action transfer from new reference videos. Future work aims to develop feed-forward methods that can achieve similar performance without per-video training.