MARRS: Masked Autoregressive Unit-based Reaction Synthesis (2505.11334v1)

Published 16 May 2025 in cs.CV

Abstract: This work aims at a challenging task: human action-reaction synthesis, i.e., generating human reactions based on the action sequence of the other as conditions. Currently, autoregressive modeling approaches have achieved remarkable performance in motion generation tasks, e.g. text-to-motion. However, vector quantization (VQ) accompanying autoregressive generation has inherent disadvantages, including loss of quantization information, low codebook utilization, etc. Moreover, unlike text-to-motion, which focuses solely on the movement of body joints, human action-reaction synthesis also encompasses fine-grained hand movements. In this work, we propose MARRS, a novel framework designed to generate coordinated and fine-grained reaction motions in continuous representations. Initially, we present the Unit-distinguished Motion Variational AutoEncoder (UD-VAE), which segments the entire body into distinct body and hand units, encoding them independently. Subsequently, we propose Action-Conditioned Fusion (ACF), which involves randomly masking a subset of reactive tokens and extracting specific information about the body and hands from the active tokens. Furthermore, we introduce Adaptive Unit Modulation (AUM) to facilitate interaction between body and hand units by using the information from one unit to adaptively modulate the other. Finally, for the diffusion model, we employ a compact MLP as a noise predictor for each distinct body unit and incorporate the diffusion loss to model the probability distribution of each token. Quantitative and qualitative results demonstrate that our method achieves superior performance. The code will be released upon acceptance.

Summary

The paper presents MARRS, a novel framework for generating realistic human action-reaction sequences using masked autoregressive modeling without the limitations of vector quantization.
MARRS introduces components like UD-VAE for unit-based encoding, ACF for masked fusion, and AUM for adaptive unit coordination to enhance detailed and coordinated motion generation.
Evaluated on the NTU120-AS dataset, MARRS achieves superior quantitative and qualitative results, demonstrating high fidelity, accurate action recognition, and improved diversity in generated human motion sequences.

Overview of MARRS: Masked Autoregressive Unit-based Reaction Synthesis

The paper presents a novel approach named MARRS for generating human action-reaction sequences by leveraging masked autoregressive modeling. This work is pertinent to applications in computer animation, game development, and robotic control, where synthesizing realistic human interactions is vital. Unlike traditional autoregressive methods, which suffer from limitations such as loss of information due to vector quantization, MARRS innovatively applies autoregressive generation without quantization, thereby addressing specific challenges in reaction synthesis.

Key Contributions and Methodology

MARRS introduces several novel components that enhance the generation of human motions:

Unit-distinguished Motion Variational AutoEncoder (UD-VAE): This module segments the whole body into distinct units such as body and hands, encoding them independently. This separation allows for a more nuanced representation of movements, especially facilitating detailed hand gestures alongside body motions.
Action-Conditioned Fusion (ACF): ACF is employed to randomly mask certain reactive tokens, enabling extraction of detailed information from active tokens. This selective masking mechanism is critical for capturing the interplay between different units of motion during interactions.
Adaptive Unit Modulation (AUM): AUM facilitates the coordination between body and hand units by adaptively modulating one unit using information from the other. This mutual modulation ensures coordinated movement generation, maintaining consistency and realism.
Diffusion Model for Prediction: The approach exploits a compact multilayer perceptron (MLP) as a noise predictor, integrating diffusion loss to model the probability distribution across tokens. This choice of noise modeling marks a significant departure from the constraints of VQ-based generation, enabling finer control over token predictions without losing information.

Experimental Results and Implications

The efficacy of the MARRS framework is demonstrated through extensive experiments using the NTU120-AS dataset. MARRS achieves superior quantitative and qualitative results compared to state-of-the-art methods, particularly in settings that require real-time and responsive generation. Notable achievements include:

High FID Scores: MARRS achieves low Frechet Inception Distance (FID) scores in various experimental settings, indicating high fidelity in generated motion sequences.
Accurate Action Recognition: The model also performs well on metrics assessing action-motion matching, underscoring its robustness in diverse motion scenarios.
Diversity and Multimodality: MARRS exhibits substantial improvements in generating diverse and multi-modal sequences, addressing prior limitations in capturing the variability inherent in human interactions.

Implications for Future AI Development

The implications of MARRS extend beyond immediate applications in animation and robotics. Its innovative approach to unit-based encoding and autoregressive modeling without vector quantization paves the way for more flexible, detailed, and efficient generative models. Anticipated future developments could explore expanding the unit division strategy to incorporate additional movement details or environmental interactions. Additionally, the principles of ACF and AUM could inform advancements in other domains such as audio-visual interactions and complex event simulations.