Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
164 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MotionStreamer: Streaming Motion Generation via Diffusion-based Autoregressive Model in Causal Latent Space (2503.15451v2)

Published 19 Mar 2025 in cs.CV

Abstract: This paper addresses the challenge of text-conditioned streaming motion generation, which requires us to predict the next-step human pose based on variable-length historical motions and incoming texts. Existing methods struggle to achieve streaming motion generation, e.g., diffusion models are constrained by pre-defined motion lengths, while GPT-based methods suffer from delayed response and error accumulation problem due to discretized non-causal tokenization. To solve these problems, we propose MotionStreamer, a novel framework that incorporates a continuous causal latent space into a probabilistic autoregressive model. The continuous latents mitigate information loss caused by discretization and effectively reduce error accumulation during long-term autoregressive generation. In addition, by establishing temporal causal dependencies between current and historical motion latents, our model fully utilizes the available information to achieve accurate online motion decoding. Experiments show that our method outperforms existing approaches while offering more applications, including multi-round generation, long-term generation, and dynamic motion composition. Project Page: https://zju3dv.github.io/MotionStreamer/

Summary

  • The paper introduces MotionStreamer, a diffusion-based autoregressive framework generating text-conditioned streaming motion by integrating continuous latents in a causal space.
  • Evaluations on HumanML3D and BABEL datasets show MotionStreamer achieves state-of-the-art results with lower FID and higher R-Precision, indicating better data distribution alignment and text-motion correlation.
  • MotionStreamer enables real-time applications like gaming and animation through dynamic motion composition and long-term generation, supported by innovative two-forward and mixed training strategies.

MotionStreamer: An AR Model for Streaming Motion Generation in Causal Latent Spaces

The paper introduces "MotionStreamer," a diffusion-based autoregressive (AR) framework designed for text-conditioned streaming motion generation. This work responds to the inadequacies of existing systems, like diffusion models limited by static motion lengths and GPT-based models hindered by non-causal tokenization, resulting in slow response times and error accumulation. MotionStreamer addresses these challenges by leveraging a continuous causal latent space in a probabilistic autoregressive model, offering a novel approach to streaming motion generation.

Methodology

MotionStreamer is built on the integration of continuous latents into motion generation, which alleviates information loss typically associated with discretization. The model is capable of predicting the subsequent human pose by integrating textual inputs incrementally while maintaining historical motion data. A diffusion head is embedded in the AR model, in conjunction with a causal motion compressor, enabling real-time decoding of motion sequences. This methodology underpins several key functionalities:

  1. Temporal Causality: By constructing temporal causal dependencies between current and previous motion latents, the framework utilizes complete available information for accurate online motion prediction.
  2. Continuous Latent Space: This space mitigates errors that occur with discrete tokens, particularly in long-sequence generation.
  3. Multiple Applications: MotionStreamer is shown to support applications including multi-round generation, long-term motion generation, and dynamic motion composition.

Results

When evaluated on prominent datasets such as HumanML3D and BABEL, MotionStreamer demonstrates superior performance compared to existing alternatives. MotionStreamer achieves state-of-the-art results in several metrics:

  • FID (Frechet Inception Distance): It consistently yields lower FID scores compared to baseline models, indicating a narrowed gap between generated and real data distributions.
  • R-Precision: The method shows higher precision retrieving the correct text-conditioned motions, suggesting improved alignment between textual input and motion output.
  • Quantitative Robustness: MotionStreamer maintains diversity and smoothness in motions, evidenced by metrics like diversity and peak jerk.

Two innovative training strategies are proposed: Two-forward training and Mixed training. These techniques collectively address error accumulation and enhance the model's adaptability to dynamic textual inputs without predefined sequence lengths.

Implications and Future Directions

The theoretical implications of this research involve a shift towards continuous latent modeling in autoregressive frameworks, which proves essential for real-time applications such as gaming, animation, and robotics. Practically, adopting a diffusion-based paradigm within continuous latent spaces marks a stride in overcoming historical inefficiencies seen in previous models, particularly those employing quantization or inflexible tokenization.

Future research could explore extending the hybrid strategies to balance between causal inference for streaming generation and bidirectional refinement for applications like motion editing. Another prospective direction involves evaluating the impacts of hybrid latent spaces that combine discrete and continuous elements to capture nuanced motion details while supporting dynamic real-time use cases.

This paper's methodologies and results illustrate a significant step forward in streaming motion generation, providing a framework that better integrates with real-time applications requiring dynamic and extended motion sequences.