TIMotion: Temporal and Interactive Framework for Efficient Human-Human Motion Generation (2408.17135v4)
Abstract: Human-human motion generation is essential for understanding humans as social beings. Current methods fall into two main categories: single-person-based methods and separate modeling-based methods. To delve into this field, we abstract the overall generation process into a general framework MetaMotion, which consists of two phases: temporal modeling and interaction mixing. For temporal modeling, the single-person-based methods concatenate two people into a single one directly, while the separate modeling-based methods skip the modeling of interaction sequences. The inadequate modeling described above resulted in sub-optimal performance and redundant model parameters. In this paper, we introduce TIMotion (Temporal and Interactive Modeling), an efficient and effective framework for human-human motion generation. Specifically, we first propose Causal Interactive Injection to model two separate sequences as a causal sequence leveraging the temporal and causal properties. Then we present Role-Evolving Scanning to adjust to the change in the active and passive roles throughout the interaction. Finally, to generate smoother and more rational motion, we design Localized Pattern Amplification to capture short-term motion patterns. Extensive experiments on InterHuman and InterX demonstrate that our method achieves superior performance. Project page: https://aigc-explorer.github.io/TIMotion-page/
- Rhythmic gesticulator: Rhythm-aware co-speech gesture synthesis with hierarchical neural embeddings. ACM Transactions on Graphics (TOG), 41(6): 1–19.
- Bethke, E. 2003. Game development and production. Wordware Publishing, Inc.
- Digital life project: Autonomous 3d characters with social intelligence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 582–592.
- Executing your commands via motion diffusion in latent space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18000–18010.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794.
- Vision-rwkv: Efficient and scalable visual perception with rwkv-like architectures. arXiv preprint arXiv:2403.02308.
- Diffusion-rwkv: Scaling rwkv-like architectures for diffusion models. arXiv preprint arXiv:2404.04478.
- Tm2d: Bimodality driven 3d dance generation via music-text integration. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 9942–9952.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752.
- RWKV-CLIP: A Robust Vision-Language Representation Learner. arXiv preprint arXiv:2406.06973.
- Momask: Generative masked modeling of 3d human motions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1900–1910.
- Generating diverse and natural 3d human motions from text. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5152–5161.
- Action2motion: Conditioned generation of 3d human motions. In Proceedings of the 28th ACM International Conference on Multimedia, 2021–2029.
- A motion matching-based framework for controllable gesture synthesis from speech. In ACM SIGGRAPH 2022 conference proceedings, 1–9.
- PointRWKV: Efficient RWKV-Like Model for Hierarchical Point Cloud Learning. arXiv preprint arXiv:2405.15214.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
- Long short-term memory. Neural computation, 9(8): 1735–1780.
- Intergen: Diffusion-based multi-human motion generation under complex interactions. International Journal of Computer Vision, 1–21.
- Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
- Computer animation. Springer.
- Improved denoising diffusion probabilistic models. In International conference on machine learning, 8162–8171. PMLR.
- Parent, R. 2012. Computer animation: algorithms and techniques. Newnes.
- Rwkv: Reinventing rnns for the transformer era. arXiv preprint arXiv:2305.13048.
- Action-conditioned 3d human motion synthesis with transformer vae. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10985–10995.
- TEMOS: Generating diverse human motions from textual descriptions. In European Conference on Computer Vision, 480–497. Springer.
- Mmm: Generative masked motion model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1546–1555.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- Saridis, G. 1983. Intelligent robotic control. IEEE Transactions on Automatic Control, 28(5): 547–557.
- Human motion diffusion as a generative prior. arXiv preprint arXiv:2303.01418.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Role-aware interaction generation from textual description. In Proceedings of the IEEE/CVF international conference on computer vision, 15999–16009.
- Motionclip: Exposing human motion generation to clip space. In European Conference on Computer Vision, 358–374. Springer.
- Human motion diffusion model. arXiv preprint arXiv:2209.14916.
- Urbain, J. 2010. Introduction to game development. Cell, 414: 745–5102.
- Neural discrete representation learning. Advances in neural information processing systems, 30.
- Vaswani, A. 2017. Attention is all you need. arXiv preprint arXiv:1706.03762.
- Towards domain generalization for multi-view 3d object detection in bird-eye-view. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13333–13342.
- Intercontrol: Generate human motion interactions by controlling every joint. arXiv preprint arXiv:2311.15864.
- Inter-x: Towards versatile human-human interaction analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 22260–22271.
- Restore-RWKV: Efficient and Effective Medical Image Restoration with RWKV. arXiv preprint arXiv:2407.11087.
- Mamba or RWKV: Exploring High-Quality and High-Efficiency Segment Anything Model. arXiv preprint arXiv:2406.19369.
- Generating human motion from textual descriptions with discrete representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 14730–14740.
- Motiondiffuse: Text-driven human motion generation with diffusion model. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- Remodiffuse: Retrieval-augmented motion diffusion model. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 364–373.
- Finemogen: Fine-grained spatio-temporal motion generation and editing. Advances in Neural Information Processing Systems, 36.
- Attt2m: Text-driven human motion generation with multi-perspective attention mechanism. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 509–519.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.