MOCHA: Real-Time Motion Characterization via Context Matching (2310.10079v1)

Published 16 Oct 2023 in cs.GR and cs.AI

Abstract: Transforming neutral, characterless input motions to embody the distinct style of a notable character in real time is highly compelling for character animation. This paper introduces MOCHA, a novel online motion characterization framework that transfers both motion styles and body proportions from a target character to an input source motion. MOCHA begins by encoding the input motion into a motion feature that structures the body part topology and captures motion dependencies for effective characterization. Central to our framework is the Neural Context Matcher, which generates a motion feature for the target character with the most similar context to the input motion feature. The conditioned autoregressive model of the Neural Context Matcher can produce temporally coherent character features in each time frame. To generate the final characterized pose, our Characterizer network incorporates the characteristic aspects of the target motion feature into the input motion feature while preserving its context. This is achieved through a transformer model that introduces the adaptive instance normalization and context mapping-based cross-attention, effectively injecting the character feature into the source feature. We validate the performance of our framework through comparisons with prior work and an ablation study. Our framework can easily accommodate various applications, including characterization with only sparse input and real-time characterization. Additionally, we contribute a high-quality motion dataset comprising six different characters performing a range of motions, which can serve as a valuable resource for future research.

Citations (5)

View on Semantic Scholar

Summary

The paper introduces a novel framework that transfers target character style and proportions in real time using a Neural Context Matcher.
It leverages a transformer-based Characterizer with Adaptive Instance Normalization and context-mapping cross-attention to balance style transfer with motion integrity.
The system enables low-latency, frame-by-frame processing suitable for interactive applications and demonstrates robust performance even with sparse input data.

MOCHA presents a framework for real-time motion characterization, enabling the transformation of neutral input motions to reflect the distinct style and body proportions of a target character (2310.10079). The system operates online, processing motion frame by frame to achieve low-latency performance suitable for interactive applications. Its core contributions lie in the Neural Context Matcher, which identifies contextually similar motion features from the target character, and the Characterizer network, which integrates these features into the source motion while preserving its underlying structure.

Methodology Overview

The MOCHA framework processes a stream of input motion data, typically represented as joint positions or rotations over time. For each input frame (or a small window of frames), the system performs the following steps:

Input Motion Encoding: The input motion sequence is encoded into a latent motion feature representation. This representation is designed to capture both the kinematic structure (body part topology) and the temporal dependencies within the motion.
Neural Context Matching: A novel Neural Context Matcher module takes the encoded input motion feature. Its goal is to retrieve or generate a corresponding motion feature from the target character's repertoire that exhibits the most similar context. This module operates autoregressively, ensuring temporal coherence in the generated target character features across consecutive frames.
Characterization: The Characterizer network receives both the input motion feature and the context-matched target character feature. It then synthesizes the final output pose by injecting the stylistic elements (timing, exaggeration, etc.) and proportional differences from the target feature into the input feature, while critically preserving the semantic content or context of the original input motion.
Output Pose Generation: The output feature from the Characterizer is decoded into the final characterized pose for the current time step.

This pipeline is designed for online, frame-by-frame processing, making it suitable for real-time character animation tasks.

Motion Feature Encoding

The initial step involves transforming the raw input motion data into a structured feature representation. MOCHA employs an encoder that explicitly considers the articulated structure of the character and the temporal dynamics of the motion. The abstract mentions structuring the body part topology and capturing motion dependencies. While the specific encoder architecture isn't fully detailed in the abstract, common approaches in motion synthesis involve Graph Neural Networks (GNNs) to model the skeletal structure or Transformers to capture long-range temporal dependencies. The resulting motion feature $f_{in}$ serves as a condensed representation of the input motion's state and recent history at time $t$ .

Neural Context Matcher

This component is central to MOCHA's ability to find appropriate stylistic references from the target character. It takes the input motion feature $f_{in}$ at time $t$ and aims to produce a target character motion feature $f_{char}$ that corresponds contextually. The "context" refers to the nature of the action being performed (e.g., walking, running, waving). The matcher needs to find a segment in the target character's motion data (or generate a feature) that represents the same type of action, even if performed with a different style and body shape.

The Neural Context Matcher is implemented as a conditioned autoregressive model. Let $f_{char}^{(t)}$ be the target character feature at time $t$ . Its generation is conditioned on the input feature $f_{in}^{(t)}$ and the previously generated character feature $f_{char}^{(t-1)}$ :

$f_{char}^{(t)} = \text{NeuralContextMatcher}(f_{in}^{(t)}, f_{char}^{(t-1)})$

This autoregressive formulation promotes temporal smoothness and coherence in the generated target style features. The conditioning on $f_{in}^{(t)}$ ensures that the generated $f_{char}^{(t)}$ is contextually relevant to the current input motion. The exact mechanism for "context matching" might involve attention mechanisms or learned distance metrics within the latent space to identify the most similar motion context in the target character's style manifold based on the input feature.

Characterizer Network

The Characterizer network is responsible for the actual fusion of style and content. It integrates the target character feature $f_{char}^{(t)}$ into the input motion feature $f_{in}^{(t)}$ to produce the final characterized feature $f_{out}^{(t)}$ , which is then decoded into the output pose. The key challenge is to inject the style (idiosyncrasies, timing, posing) and proportions of the target character without corrupting the fundamental action defined by $f_{in}^{(t)}$ .

MOCHA employs a transformer-based architecture for the Characterizer. Two specific mechanisms are highlighted:

Adaptive Instance Normalization (AdaIN): AdaIN is commonly used for style transfer in images and is adapted here for motion. It aligns the mean and variance of the input feature map (derived from $f_{in}^{(t)}$ ) to match those of the target style feature map (derived from $f_{char}^{(t)}$ ). If $x$ is an activation map within the Characterizer derived from $f_{in}^{(t)}$ , and $\mu(\cdot), \sigma(\cdot)$ denote channel-wise mean and standard deviation, then AdaIN performs:

$\text{AdaIN}(x, f_{char}) = \sigma(f_{char}) \left( \frac{x - \mu(x)}{\sigma(x)} \right) + \mu(f_{char})$

This effectively transfers the second-order statistics, often associated with style, from $f_{char}$ to $x$ .
Context Mapping-based Cross-Attention: To ensure the input motion's context is preserved and accurately guides the stylization, MOCHA uses a specialized cross-attention mechanism. This allows the network to selectively focus on relevant parts of the input feature $f_{in}^{(t)}$ when integrating information from the target character feature $f_{char}^{(t)}$ . The "context mapping" likely refers to how the attention weights are computed or applied to ensure that stylistic modifications are appropriate for the specific action context encoded in $f_{in}^{(t)}$ . This prevents artifacts where the style transfer overrides the intended motion.

The combination of AdaIN and context-aware cross-attention within the transformer allows the Characterizer to effectively blend the source motion's structure with the target character's unique style and proportions.

Real-Time Implementation and Applications

A significant aspect of MOCHA is its suitability for real-time applications. The frame-by-frame processing pipeline, leveraging potentially efficient neural network components like transformers and potentially GNNs, facilitates low-latency characterization. This capability enables direct use in interactive systems like games, virtual reality avatars, or live performance retargeting.

The paper explicitly mentions applicability to scenarios with sparse input. This suggests the motion encoder and the overall framework are robust to incomplete input data, potentially using interpolation or learned priors to handle missing joint information, which is common in real-time tracking systems.

Furthermore, MOCHA contributes a high-quality motion dataset featuring six distinct characters performing various actions. This dataset serves not only to train and evaluate MOCHA but also as a resource for future research in character animation and style transfer.

Evaluation

The effectiveness of MOCHA was assessed through comparisons with existing motion style transfer techniques and ablation studies. The comparisons likely focused on metrics evaluating style similarity to the target character, preservation of the source motion content, temporal coherence, and computational performance (latency). The ablation studies systematically removed or replaced key components (like the Neural Context Matcher, AdaIN, or context-mapping cross-attention) to quantify their individual contributions to the overall performance. The results presented in the paper presumably demonstrate MOCHA's advantages in terms of characterization quality, real-time capability, and temporal stability compared to prior art.

Conclusion

MOCHA offers a coherent framework for online motion characterization, addressing the simultaneous transfer of motion style and body proportions in real time. Its key innovations, the Neural Context Matcher for temporally coherent style retrieval and the transformer-based Characterizer utilizing AdaIN and context-mapping cross-attention, enable effective style integration while preserving the input motion's context. The framework's real-time performance and robustness to sparse inputs make it a practical solution for various interactive character animation applications.

PDF Markdown