DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation (2505.18078v1)

Published 23 May 2025 in cs.CV

Abstract: Controllable video generation (CVG) has advanced rapidly, yet current systems falter when more than one actor must move, interact, and exchange positions under noisy control signals. We address this gap with DanceTogether, the first end-to-end diffusion framework that turns a single reference image plus independent pose-mask streams into long, photorealistic videos while strictly preserving every identity. A novel MaskPoseAdapter binds "who" and "how" at every denoising step by fusing robust tracking masks with semantically rich-but noisy-pose heat-maps, eliminating the identity drift and appearance bleeding that plague frame-wise pipelines. To train and evaluate at scale, we introduce (i) PairFS-4K, 26 hours of dual-skater footage with 7,000+ distinct IDs, (ii) HumanRob-300, a one-hour humanoid-robot interaction set for rapid cross-domain transfer, and (iii) TogetherVideoBench, a three-track benchmark centered on the DanceTogEval-100 test suite covering dance, boxing, wrestling, yoga, and figure skating. On TogetherVideoBench, DanceTogether outperforms the prior arts by a significant margin. Moreover, we show that a one-hour fine-tune yields convincing human-robot videos, underscoring broad generalization to embodied-AI and HRI tasks. Extensive ablations confirm that persistent identity-action binding is critical to these gains. Together, our model, datasets, and benchmark lift CVG from single-subject choreography to compositionally controllable, multi-actor interaction, opening new avenues for digital production, simulation, and embodied intelligence. Our video demos and code are available at https://DanceTog.github.io/.

Summary

DanceTogether: Identity-Preserving Multi-Person Interactive Video Generation

The paper discusses DanceTogether, an innovative diffusion framework designed for controllable video generation involving multiple actors. The objective of this research is to address the inadequacies of existing controllable video generation systems when handling scenarios that require the motion and interaction of more than one actor. Existing methods often suffer from identity drift and appearance bleeding in multi-person settings due to noisy control signals, such as occlusion, motion blur, and viewpoint changes. DanceTogether aims to mitigate these limitations through rigorous identity preservation and interaction coherence, leveraging pose and mask sequences derived from a single reference image.

Key Contributions

MaskPoseAdapter: A novel component of DanceTogether that combines robust tracking masks with noisy pose heat-maps to ensure consistent identity preservation and action fidelity. This fusion is performed at every denoising step, thereby reducing frame-wise inconsistencies that lead to identity drift.
MultiFace Encoder: Encodes identity tokens from a single image, ensuring each actor's appearance remains constant throughout the generated video sequence. This component is crucial for maintaining identity consistency.
Data and Evaluation Benchmark: The paper introduces PairFS-4K and HumanRob-300 datasets to facilitate training and evaluation at scale, along with TogetherVideoBench for assessing DanceTogether’s performance against existing methods across three tracks: Identity-Consistency, Interaction-Coherence, and Video Quality.
Superior Performance: DanceTogether decisively outshines prior technologies in identity consistency as evidenced by substantial improvements in HOTA (+12.6), IDF1 (+7.1), and MOTA (+5.9). Furthermore, MPJPE $_{2D}$ is reduced by 69%, indicating significant enhancement in interaction coherence. The framework also achieves improved scores in visual fidelity metrics, including FVD and FID values.

Experimental Setup and Results

DanceTogether is trained using large-scale single and multi-person datasets, including the innovative PairFS-4K dataset with thousands of unique identities engaging in dual-person figure skating. The experimental evaluation utilizes multiple object tracking metrics to evaluate identity consistency and human pose estimation measures to assess interaction coherence. Quantitative results highlight DanceTogether's ability to generate high-resolution videos that maintain actor identity and realistic interactions, even under challenging conditions of position exchange and lively choreography.

Implications and Future Directions

From a practical perspective, DanceTogether holds substantial promise for digital media production, allowing for the creation of visually coherent and identity-consistent multi-actor videos. Additionally, the framework's adaptability to new scenarios, such as human-robot interaction, opens avenues for embodied AI applications, further incentivizing research in human-computer interaction.

Theoretically, DanceTogether presents a departure from conventional frame-wise approaches, underscoring the importance of persistent identity-action binding in video diffusion processes. Future research may explore larger group interactions, adaptive camera motions, and more complex scene settings, potentially enhancing the robustness and scalability of motion-driven video synthesis.

In summary, DanceTogether significantly advances controllable video generation by addressing fundamental limitations in identity preservation and multi-actor interaction fidelity. Its contributions offer researchers and practitioners new methodologies for producing realistic and coherent human motion videos across various domains.

DanceTogether! Identity-Preserving Multi-Person Interactive Video Generation (2505.18078v1)

Summary