Exploring Motion-Language Alignment for Text-driven Motion Generation

Published 3 Apr 2026 in cs.CV | (2604.02973v1)

Abstract: Text-driven human motion generation aims to synthesize realistic motion sequences that follow textual descriptions. Despite recent advances, accurately aligning motion dynamics with textual semantics remains a fundamental challenge. In this paper, we revisit text-to-motion generation from the perspective of motion-language alignment and propose MLA-Gen, a framework that integrates global motion priors with fine-grained local conditioning. This design enables the model to capture common motion patterns, while establishing detailed alignment between texts and motions. Furthermore, we identify a previously overlooked attention sink phenomenon in human motion generation, where attention disproportionately concentrates on the start text token, limiting the utilization of informative textual cues and leading to degraded semantic grounding. To analyze this issue, we introduce SinkRatio, a metric for measuring attention concentration, and develop alignment-aware masking and control strategies to regulate attention during generation. Extensive experiments demonstrate that our approach consistently improves both motion quality and motion-language alignment over strong baselines. Code will be released upon acceptance.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces the MLA-Gen framework that leverages learnable memory slots, cross-modal token alignment, and sink-aware strategies to improve motion synthesis.
It employs fine-grained alignment mechanisms and adaptive guidance to mitigate the attention sink phenomenon, ensuring robust temporal and semantic consistency.
Empirical evaluations on HumanML3D demonstrate significant gains in FID, R-Precision, and semantic matching compared to prior methods.

Motion-Language Alignment for Text-Driven Motion Generation: The MLA-Gen Framework

Motivation and Challenges in Text-to-Motion Generation

Text-driven human motion generation requires synthesizing temporally and spatially coherent 3D motion sequences conditioned on natural language descriptions. Prevailing methods predominantly utilize global text embeddings (e.g., CLIP-based features) to guide motion generation, which are effective for capturing high-level semantics but insufficient for establishing detailed, temporally sensitive alignment between individual tokens and specific motion dynamics. This deficiency results in generated motions that often match coarse intent but lack fidelity in fine-grained movement details and semantic grounding, as exemplified in numerous prior failure cases.

Figure 1: Failure cases from previous text-to-motion generation framework, which captures global motion patterns but often overlooks fine-grained motion details.

A central technical challenge, therefore, is achieving fine-grained motion-language alignment, ensuring that the motion generator not only adheres to the global context but also exploits local textual cues for enhanced realism and semantic precision.

The MLA-Gen Framework: Components and Architectural Innovations

MLA-Gen introduces a principled architectural approach to tackle these alignment issues via three synergistic modules: learnable memory slots, a motion-language alignment mechanism, and attention sink mitigation strategies.

Figure 2: Overview of the MLA-Gen framework, comprising memory slots, fine-grained motion-language alignment, and attention sink mitigation strategies.

Memory Slots for Global Motion Priors:

MLA-Gen employs a set of learnable memory slots that serve as global motion prototypes, encoding shared structural patterns across the motion dataset. Within each transformer layer, motion representations query these slots via multi-head attention, retrieving contextually relevant priors that improve overall motion coherence. Heatmap visualizations of slot activations reveal heterogeneous slot focus, underpinning the slot-based structural modeling.

Figure 3: Heatmap of memory slot activation reveals heterogeneous retrieval of motion prototypes.

Fine-Grained Motion-Language Alignment:

Motion frames and text tokens are mapped into embedding spaces through respective encoders. A cross-modal attention mechanism computes dynamic alignment between each motion frame and textual token, facilitating granular conditioning. Local text-token features are projected, reweighted, and fused with global text embeddings, providing both coarse and fine semantic guidance. Attention heatmaps demonstrate robust alignment to semantically rich tokens (e.g., "aims", "throws", "baseball").

Figure 4: Heatmap of motion-language alignment, with high attention on semantically rich text tokens.

Attention Sink Phenomenon and SinkRatio Metric

MLA-Gen identifies, quantifies, and mitigates an "attention sink" phenomenon: cross-modal attention systematically fixates on the start token of textual input, inhibiting exploitation of informative tokens—a finding that parallels attention sink issues documented in LLMs [xiao2023efficient, barbero2025llms, rulli2025attention].

SinkRatio Metric:

A top- $K$ strategy is adopted to compute SinkRatio, the mean concentration of attention weights to the highest-attended tokens for each motion frame. Higher SinkRatio values indicate greater attention fixation, leading to reduced semantic informativeness. Ablative masking experiments show that attention sink persists even after masking the start token, relocating to subsequent neutral tokens, indicative of deep-seated adaptive model bias.

Figure 5: Heatmaps comparison of alignment on masked model (left) and unmasked model (right); masking reduces but does not eliminate sink.

Temporal SinkRatio Analysis:

Temporal curves demonstrate that sink-mask mechanisms induce decreasing SinkRatio across timesteps, mitigating the intensification of attention sink and promoting balanced token utilization.

Figure 6: SinkRatio curves for masked and unmasked models; masking reduces attention concentration.

Sink-Aware Generation Strategies

MLA-Gen introduces two sink-aware strategies:

Sink-Mask (Token Weight Masking):

Start-token attention weights are masked above a timestep threshold, enforcing more distributed attention and compelling the model to utilize broader semantic cues.

Sink-Ctrl (Adaptive Classifier-Free Guidance):

CFG guidance strength is rectified by an adaptive coefficient scaled by SinkRatio, amplifying corrective updates when SinkRatio is high. Local text features are injected into unconditional branches to stabilize generation.

Empirical Evaluation and Ablation

MLA-Gen is extensively evaluated on the HumanML3D benchmark. Strong numerical gains are reported in FID ( $0.107 \rightarrow 0.056$ small-scale, $0.083 \rightarrow 0.040$ big-scale), R-Precision, Matching, and CLIP score, indicating improved distributional quality, semantic alignment, and motion diversity. Ablation studies confirm the necessity of both memory slots and local alignment modules, with sink-aware strategies yielding marked performance gains. Visualization comparisons against ACMDM reveal higher fidelity in temporal consistency and joint-level semantic matching.

Figure 7: Visualization comparison between ACMDM-S and MLA-Gen-S, highlighting joint-level and temporal semantic fidelity.

Limitations and Prospects

Despite robust alignment improvements, MLA-Gen exhibits limitations with lengthy or syntactically ambiguous textual descriptions, where attention-based mechanisms struggle to capture complex hierarchical semantics.

Figure 8: A failure case of MLA-Gen under a very long textual description.

SinkRatio, while effective for quantifying attention concentration, does not directly capture higher-order or structured semantic dependencies. Future research may involve extending alignment diagnostics, integrating structured priors, and scaling alignment-aware strategies to broader multimodal domains (e.g., video generation, agent control).

Conclusion

MLA-Gen advances text-driven motion generation via explicit motion-language alignment, mitigation of attention sink bias, and adaptive generation strategies. These findings underscore the criticality of detailed cross-modal alignment in multimodal generation architectures and highlight opportunities for future methodological innovation in semantically grounded synthesis.

Markdown Report Issue