DanceTogether: Identity-Preserving Multi-Person Interactive Video Generation
The paper discusses DanceTogether, an innovative diffusion framework designed for controllable video generation involving multiple actors. The objective of this research is to address the inadequacies of existing controllable video generation systems when handling scenarios that require the motion and interaction of more than one actor. Existing methods often suffer from identity drift and appearance bleeding in multi-person settings due to noisy control signals, such as occlusion, motion blur, and viewpoint changes. DanceTogether aims to mitigate these limitations through rigorous identity preservation and interaction coherence, leveraging pose and mask sequences derived from a single reference image.
Key Contributions
- MaskPoseAdapter: A novel component of DanceTogether that combines robust tracking masks with noisy pose heat-maps to ensure consistent identity preservation and action fidelity. This fusion is performed at every denoising step, thereby reducing frame-wise inconsistencies that lead to identity drift.
- MultiFace Encoder: Encodes identity tokens from a single image, ensuring each actor's appearance remains constant throughout the generated video sequence. This component is crucial for maintaining identity consistency.
- Data and Evaluation Benchmark: The paper introduces PairFS-4K and HumanRob-300 datasets to facilitate training and evaluation at scale, along with TogetherVideoBench for assessing DanceTogether’s performance against existing methods across three tracks: Identity-Consistency, Interaction-Coherence, and Video Quality.
- Superior Performance: DanceTogether decisively outshines prior technologies in identity consistency as evidenced by substantial improvements in HOTA (+12.6), IDF1 (+7.1), and MOTA (+5.9). Furthermore, MPJPE2D​ is reduced by 69%, indicating significant enhancement in interaction coherence. The framework also achieves improved scores in visual fidelity metrics, including FVD and FID values.
Experimental Setup and Results
DanceTogether is trained using large-scale single and multi-person datasets, including the innovative PairFS-4K dataset with thousands of unique identities engaging in dual-person figure skating. The experimental evaluation utilizes multiple object tracking metrics to evaluate identity consistency and human pose estimation measures to assess interaction coherence. Quantitative results highlight DanceTogether's ability to generate high-resolution videos that maintain actor identity and realistic interactions, even under challenging conditions of position exchange and lively choreography.
Implications and Future Directions
From a practical perspective, DanceTogether holds substantial promise for digital media production, allowing for the creation of visually coherent and identity-consistent multi-actor videos. Additionally, the framework's adaptability to new scenarios, such as human-robot interaction, opens avenues for embodied AI applications, further incentivizing research in human-computer interaction.
Theoretically, DanceTogether presents a departure from conventional frame-wise approaches, underscoring the importance of persistent identity-action binding in video diffusion processes. Future research may explore larger group interactions, adaptive camera motions, and more complex scene settings, potentially enhancing the robustness and scalability of motion-driven video synthesis.
In summary, DanceTogether significantly advances controllable video generation by addressing fundamental limitations in identity preservation and multi-actor interaction fidelity. Its contributions offer researchers and practitioners new methodologies for producing realistic and coherent human motion videos across various domains.