Latent-aware Action Streaming

Updated 10 March 2026

Latent-aware Action Streaming is a framework that leverages temporally-consistent latent representations for sequential prediction and control.
The method streams latent features alongside autoregressive models to decouple core action information from high-dimensional observations, ensuring low-latency inference.
It achieves long-horizon consistency and has been applied successfully in human motion stylization, online action understanding, embodied navigation, and policy transfer.

Latent-aware Action Streaming is a methodological paradigm in sequential prediction and control that structures action or motion generation around compact, temporally-consistent latent representations, processed incrementally to support real-time responsiveness, efficient adaptation, and long-horizon consistency. This approach has emerged as a unifying framework across human motion stylization (Ren et al., 17 Oct 2025), online action understanding (Yang et al., 2024), navigation in embodied vision-language settings (Fan et al., 4 Mar 2026), and world modeling for policy transfer and adaptation (Gao et al., 24 Mar 2025). The key shared insight is to encode action-relevant transitions or contextual cues into a latent space, and to stream these representations—often in tandem with predictive, autoregressive models—so as to decouple core action information from high-dimensional observations while ensuring efficient, low-latency inference and causality.