Lifelong Empathic Motion Generation
- L²-EMG is a machine learning paradigm that enables continuous, emotion-conditioned motion synthesis across diverse scenarios, leveraging a dynamic mixture of experts.
- The ES-MoE framework employs motion tokenization, causal-guided emotion decoupling, and scenario-adapted expert constructing to prevent catastrophic forgetting.
- Extensive evaluations show enhanced emotion transfer, superior scenario adaptation, and reduced forgetting compared to advanced baseline models.
Lifelong Empathic Motion Generation (L²-EMG) is a machine learning paradigm establishing continual, scenario-adaptive emotional motion synthesis by LLMs. Unlike traditional approaches restricted to scale-fixed datasets, L²-EMG targets open-ended, scale-increasing motion domains—such as sports and dance—requiring persistent acquisition of emotion-conditioned motion knowledge across previously unseen contexts. The framework is architected to support a closed-loop, self-evolving embodied agent, emphasizing both empathic expressivity and incremental scenario adaptation, and addresses two critical technical challenges: emotion decoupling (separating emotion signals from context) and scenario adapting (retaining prior skills while incorporating new ones). The reference implementation is the Emotion-Transferable and Scenario-Adapted Mixture of Experts (ES-MoE) approach, validated on newly constructed lifelong motion datasets (Wang et al., 22 Dec 2025).
1. Architectural Foundations
The ES-MoE approach is structured into three sequential components:
- Motion Tokenization: A VQ-VAE encodes raw 3D motion sequences into discrete tokens, enabling compact, language-model-friendly motion representations.
- Causal-Guided Emotion Decoupling (CGED): This block receives text and motion tokens and employs causal front-door adjustment to extract emotion features that are scenario-agnostic, mitigating confounding motion content.
- Scenario-Adapted Expert Constructing (SAMoE): This module assembles a Mixture-of-Experts (MoE), where each expert is a scenario-specific Low-Rank Adapter (LoRA), and a gating network dynamically aggregates expert outputs. Only the LoRA corresponding to the current scenario is updated, while all previous adapters are frozen.
The model parameterization at scenario is
where denotes the static LLM weights, is the LoRA parameter for scenario , and its gating weight. This schema is designed to prevent catastrophic forgetting and support lifelong accumulation of scenario-specific competencies.
2. Causal-Guided Emotion Decoupling Block
The CGED block addresses the emotion decoupling challenge by modeling the causal structure: where is the concatenated input (text embeddings and motion tokens), denotes decoupled intermediate features, represents confounding shallow motion semantics, and is the emotion label. The back-door path is neutralized using Pearl's front-door adjustment: This formulation is approximated via Normalized Weighted Geometric Mean (NWGM) through attention-based mechanisms. Two key sampling processes are leveraged:
- Self-sampling: Computes by self-attention within a sample.
- Cross-sampling: Computes by attending to a global emotion codebook.
The concatenated vectors are projected to obtain emotion-highlighted features . Emotion clarity is enforced by an attached classifier optimized with cross-entropy loss:
3. Scenario-Adapted Expert Constructing Block
Upon introduction of a new scenario , a LoRA module with fixed rank is instantiated. The scenario-to-adapter assignment is retained lifelong; and prior are held static. The MoE gating network maintains for each expert a key . Given emotion-highlighted features , a query is produced, and gate weights are computed as: During training, a random subset of old experts may be masked to limit parameter growth and to encourage balanced relevance across scenarios. No explicit meta-learning objective is required; the MoE with frozen adapters implicitly regularizes the system and preserves prior task abilities.
4. Lifelong Continual Learning Regime
L²-EMG employs an incremental learning protocol over scenarios:
- At each step , the available data is restricted to the current scenario , and only and gating query parameters are updated.
- The training objective combines the next-token loss for the LLM with the emotion classification loss: with and historical adapters frozen.
- The MoE's aggregation at inference uses all adapters, retaining consolidated knowledge.
This yields resistance to catastrophic forgetting. Freezing previous LoRAs prevents parameter drift, and masking in the gating step discourages dominance by any single expert module.
5. Dataset Construction and Preprocessing
Datasets for L²-EMG are composed of approximately 20,000 text–motion pairs, each sample labeled with one of six emotions (Happy, Sad, Angry, Fear, Surprise, Neutral) and partitioned into eight defined scenarios: Daily Life, Sports, Dance, Shows, Game, Animation, Instrument Play, and Acrobatics.
Two evaluation splits are constructed:
- Unseen L²-EMG: Strictly sequential training on scenario-specific partitions.
- Mixed L²-EMG: Each mini-batch is scenario-skewed but includes samples from all scenarios to varying degrees.
Motion sequences are encoded into discrete tokens by a pretrained VQ-VAE. Texts are systematically prefixed with "Generate a motion sequence that aligns with the following emotional text description." Each scenario subset is partitioned into train/val/test splits at a ratio of 80/5/15.
6. Evaluation Metrics and Comparative Results
Performance is quantified with the following metrics, averaged over all scenarios after the final training stage ():
- : average FID (Fréchet Inception Distance, lower is better)
- : average top-1 R-Precision (higher is better)
- : average diversity score
- : average multimodality score
- : average weighted-F1 for emotion classification
- : forgetting rate (negative values indicate less forgetting)
| Metric | AF↓ | AR↑ | AD↑ | AMM↑ | AWF↑ | FR↓ |
|---|---|---|---|---|---|---|
| SAPT (Unseen) | 2.12 | 0.237 | 9.61 | 1.59 | 0.313 | –0.54 |
| ES-MoE (Unseen) | 1.89 | 0.241 | 9.74 | 1.47 | 0.340 | –1.03 |
| SAPT (Mixed) | 1.65 | 0.245 | 9.47 | 1.82 | 0.327 | –1.97 |
| ES-MoE (Mixed) | 1.39 | 0.259 | 9.87 | 1.65 | 0.347 | –3.03 |
ES-MoE achieves consistently lower FID, higher R-Precision and AWF, and more negative FR—reflecting superior emotion transfer, scenario retention, and lower catastrophic forgetting relative to advanced baselines.
7. Technical Challenges, Limitations, and Future Directions
CGED enables the consistent insertion of emotion cues (e.g., "sad" implying slow gait, lowered head) irrespective of scenario content, while SAMoE maintains motion style identity across newly learned contexts. The architecture’s use of frozen LoRA modules and expert gating strikes a balance between plasticity for new scenarios and retention of prior emotional motion strategies.
Notable limitations include reliance on explicit emotion annotations, potential underfitting for complex scenarios due to fixed LoRA rank, and uniform rather than adaptive sparsity in gate assignment. Proposed directions for further research include integration of zero-shot human-scene interaction (4D synthesis), end-to-end co-training of the VQ-VAE and ES-MoE, and hierarchical expert allocation with emotion-specific sub-modules (Wang et al., 22 Dec 2025).
L²-EMG, as instantiated by ES-MoE, substantiates a scalable mechanism for lifelong, continuously empathic motion generation in LLMs, resolving emotion decoupling and scenario adaptation through causal inference-driven feature isolation and modular expert aggregation.