MoLingo: Text-to-Motion Generation Model

Updated 4 July 2026

MoLingo is a text-to-motion model that creates realistic 3D human movements using a semantically aligned latent space and auto-regressive denoising.
It leverages multi-token cross-attention with a T5-Large encoder to inject detailed text conditioning, enhancing temporal coherence and motion quality.
The approach emphasizes latent space design and continuous motion generation while consciously omitting fine-grained hand articulations for improved body dynamics.

MoLingo is a text-to-motion generation model introduced in "MoLingo: Motion-Language Alignment for Text-to-Motion Generation" (He et al., 15 Dec 2025). It generates 3D human motion from natural language by denoising in a continuous latent space, and it is organized around three coupled design choices: a semantic-aligned motion encoder, masked auto-regressive rectified-flow generation, and multi-token cross-attention text conditioning. In the reported formulation, the central claim is that latent-space design and conditioning strategy are as important as the generative backbone itself for obtaining realistic motion and strong text-motion alignment (He et al., 15 Dec 2025).

1. Definition, scope, and nomenclature

In the supplied literature, MoLingo is the proper name of a text-to-motion (T2M) system rather than a multilingual ASR or sign-LLM. Its formal scope is the synthesis of human motion from text prompts, with generation performed over continuous motion latents instead of directly over pose trajectories (He et al., 15 Dec 2025).

The model is motivated by two questions. The first is how to construct a latent space whose geometry is semantically aligned with language, so that diffusion-style denoising becomes more effective. The second is how to inject textual conditioning so that motion follows the prompt closely without collapsing rich linguistic structure into a single vector (He et al., 15 Dec 2025). The resulting design combines continuous latent motion generation, semantic alignment in the latent space, auto-regressive generation, and cross-attention-based conditioning.

The surrounding literature contains several nearby but distinct uses that can cause terminological confusion. "Mixture-of-Expert Conformer for Streaming Multilingual ASR" proposes a streaming multilingual ASR system with MoE layers inside a Conformer encoder; it is described as relevant to a "MoLingo-style" multilingual streaming ASR setting, but its actual method is a Mixture-of-Expert Conformer, not MoLingo (Hu et al., 2023). "Multilingual Gloss-free Sign Language Translation: Towards Building a Sign Language Foundation Model" describes a multilingual gloss-free SLT framework explicitly named Sign2(LID+Text); the supplied description associates it with "MoLingo" informally, but the paper’s method name is not MoLingo (Tan et al., 30 May 2025). "Mixture of Lookup Key-Value Experts" names its model MoLKV and explicitly states that there is no evidence that "MoLingo" is a distinct term in that work (Wang, 10 Dec 2025). A separate sign-language production paper proposes MoMP, a Mixture of Motion Primitives architecture, rather than MoLingo (Saunders et al., 2021). Accordingly, within this corpus, MoLingo properly denotes the 2025 text-to-motion model (He et al., 15 Dec 2025).

2. Overall formulation and architectural decomposition

MoLingo is organized into two major stages. The first is motion autoencoding, in which a motion sequence is encoded into a shorter latent sequence and decoded back to motion. The second is auto-regressive latent denoising, in which a transformer plus rectified flow iteratively denoises masked latents while conditioning on text through multi-token cross-attention (He et al., 15 Dec 2025).

Given a motion sequence

$\mathbf{m}_{1:N}, \qquad \mathbf{m}_i \in \mathbb{R}^D,$

the encoder maps it to a shorter latent sequence

$m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$

The encoder and decoder are built with causal 1D convolutions, which the paper uses to preserve temporal structure (He et al., 15 Dec 2025).

The generative stage adopts a masked auto-regressive factorization:

$p(m_1,\dots,m_l)=\prod_{i=1}^{l} p(m_i \mid c, m_1,\dots,m_{i-1}),$

where $c$ is the text prompt. Generation therefore proceeds sequentially rather than denoising the full latent sequence in a single step. The supplied interpretation is that this improves temporal coherence and preserves fine motion structure (He et al., 15 Dec 2025).

This decomposition is explicitly framed against several alternatives. Direct pose-space diffusion is described as difficult because the motion distribution is complex and mocap noise can introduce artifacts. Latent diffusion is presented as helpful, but prior latent choices are said to suffer from three recurrent problems: single global latents can lose temporal detail, VQ or tokenized models introduce quantization error, and latent spaces are often insufficiently aligned with text semantics (He et al., 15 Dec 2025). MoLingo’s architecture is therefore a response to both a representation problem and a conditioning problem.

3. Semantic-aligned latent space

A defining feature of MoLingo is its semantic-aligned motion encoder. The latent space is not trained only for reconstruction; it is additionally regularized so that latents with similar textual meaning remain close. The paper states that this makes the latent space more diffusion-friendly because semantically similar motions are nearer in latent geometry and denoising becomes smoother (He et al., 15 Dec 2025).

Semantic alignment is trained using BABEL, which provides frame-level text annotations. For each motion latent $m_j$ , the procedure is: collect the text labels from the temporally aligned frames, encode those labels using a frozen text encoder, average the resulting text embeddings, and project them to the latent dimension to produce a class token $\kappa_j$ (He et al., 15 Dec 2025). The semantic loss is

$\mathcal{L}_{\text{sem}}= \frac{1}{|\mathcal{I}|}\sum_{i\in \mathcal{I}} \left( 1-\frac{m_i \cdot \kappa_i}{\|m_i\|\,\|\kappa_i\|} \right).$

This is a cosine-similarity objective between motion latents and text-derived class tokens.

The paper also introduces repetitive-label filtering. Because BABEL contains repetitive frame-level annotations, adjacent latents may receive identical or nearly identical class tokens too frequently. To reduce over-constraint, the model computes

$\Delta_i = \langle \kappa_i, \kappa_{i+1} \rangle$

and discards a pair from the semantic loss when $\Delta_i > \tau$ (He et al., 15 Dec 2025). The remaining index set is denoted by $\mathcal{I}$ . In the supplied interpretation, this functions as a soft semantic regularizer: it preserves local variability while still aligning motion with semantics.

The semantic-aligned autoencoder objective extends the reconstruction loss with both semantic and KL terms:

$m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 0

with

$m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 1

The reconstruction term itself combines feature, joint, and velocity losses:

$m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 2

The reported ablations state that cosine semantic loss works better than InfoNCE, and that a small semantic weight is strongest, with $m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 3 identified as the best setting (He et al., 15 Dec 2025).

4. Text conditioning and masked auto-regressive rectified flow

MoLingo compares two conditioning strategies. The first is single-token conditioning, in which the text is compressed into one embedding or token and injected through AdaLN-style modulation. The second is multi-token cross-attention, in which multiple text tokens are preserved and used as keys and values for cross-attention with motion latents (He et al., 15 Dec 2025). The paper reports that cross-attention is clearly better for both realism and text-motion alignment.

Text is encoded by a frozen T5-Large encoder, after which a text adapter consisting of $m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 4 transformer encoder blocks further processes the embeddings into a token sequence

$m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 5

Within the transformer decoder, motion latents form the main sequence, self-attention operates over motion tokens, and cross-attention uses text tokens as keys and values (He et al., 15 Dec 2025). The strongest reported configuration is T5 + CrossAttn + SAE with a 6-layer text adapter.

The generative objective is rectified flow. For a latent $m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 6, a noisy version is formed by interpolation between Gaussian noise and the clean latent:

$m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 7

The model then learns a flow field $m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 8 via

$m_{1:l} \in \mathbb{R}^{l \times d}, \qquad l = N/h.$ 9

where $p(m_1,\dots,m_l)=\prod_{i=1}^{l} p(m_i \mid c, m_1,\dots,m_{i-1}),$ 0 is the conditioning vector from the transformer (He et al., 15 Dec 2025).

The generation backbone is specified as a standard decoder-only Transformer with 16 layers, 16 attention heads, and hidden size $p(m_1,\dots,m_l)=\prod_{i=1}^{l} p(m_i \mid c, m_1,\dots,m_{i-1}),$ 1. Its predicted conditioning vector $p(m_1,\dots,m_l)=\prod_{i=1}^{l} p(m_i \mid c, m_1,\dots,m_{i-1}),$ 2 is passed to an MLP with 8 residual blocks and width 1280. The MLP uses LayerNorm, linear layers, SiLU, and residual connections, and the rectified-flow MLP is modulated using AdaLN with $p(m_1,\dots,m_l)=\prod_{i=1}^{l} p(m_i \mid c, m_1,\dots,m_{i-1}),$ 3 plus the time embedding (He et al., 15 Dec 2025).

Training and inference use a masking scheme. During training, some motion latents are randomly masked and replaced by learnable mask tokens; the paper notes that bidirectional attention is used so masked latents can attend to all unmasked latents. During inference, all latents begin masked, the model iteratively fills them in, and the decoder finally maps the clean latent sequence back to motion (He et al., 15 Dec 2025). The training recipe further includes classifier-free guidance with 10% null prompts during training and CFG scale 6.0 at inference.

5. Datasets, metrics, and empirical results

MoLingo is trained and evaluated primarily on HumanML3D, which contains 29,024 motions and 87,834 text descriptions, with motions sourced from AMASS and HumanAct12 and standardized to 20 FPS and up to 10 seconds. BABEL is used for semantic alignment, and additional evaluation settings include HumanML3D-272, HumanML3D with TMR-263, MARDM-67, and MS-272 protocols (He et al., 15 Dec 2025).

The reported metrics are standard T2M measures: FID, R-Precision (Top-1, Top-2, Top-3), CLIP-Score, Matching Score, and MultiModality (MModality) (He et al., 15 Dec 2025). The paper emphasizes that protocol choice matters, particularly across MARDM-67, TMR-263, and MS-272.

The following table collects the principal quantitative results stated for MoLingo.

Protocol / comparison	Reported MoLingo result	Notes
MARDM-67, MoLingo (VAE)	FID 0.049, Top-1 0.528, CLIP 0.672	Best FID among MoLingo variants
MARDM-67, MoLingo (SAE)	FID 0.064, Top-1 0.542, CLIP 0.686	Strongest text alignment
TMR-263	FID 0.014, Top-1 0.772, Top-2 0.889, Top-3 0.928	Reported state of the art
MotionStreamer comparison (272D)	rFID 0.280, FID 3.444, Top-1 0.788, Matching 14.591	MotionStreamer: 0.661, 11.979, 0.629, 16.019

The ablation findings are tightly aligned with the model design. Semantically aligned latents help: the SAE variant improves R-Precision and CLIP-score while keeping FID competitive. Auto-regressive generation helps: the paper associates it with better temporal coherence, finer motion detail, and improved handling of ordered actions, dynamic transitions, and complex motions such as cartwheels and turns. Cross-attention beats single-token conditioning: single-token AdaLN conditioning is weaker, T5 improves conditioning quality, and full-token cross-attention gives the best overall balance of FID, R-Precision, and CLIP-score (He et al., 15 Dec 2025).

Training details are also explicitly reported. The motion autoencoder uses batch size 256, 5000 epochs, and learning rate $p(m_1,\dots,m_l)=\prod_{i=1}^{l} p(m_i \mid c, m_1,\dots,m_{i-1}),$ 4. The auto-regressive rectified-flow model uses batch size 256, approximately 800 epochs, and a schedule with 100 epochs of warmup followed by learning rate $p(m_1,\dots,m_l)=\prod_{i=1}^{l} p(m_i \mid c, m_1,\dots,m_{i-1}),$ 5; EMA is used. The best model reportedly trains in about 10 hours on 4 Nvidia H100 GPUs (He et al., 15 Dec 2025).

The user study compares MoLingo with MoMask, DisCoRD, and MotionStreamer for realism and text-motion alignment. Reported preference rates for MoLingo are 83.75% versus DisCoRD, 77.70% versus MoMask, and 84.70% versus MotionStreamer (He et al., 15 Dec 2025).

MoLingo belongs to the broader family of language-conditioned generative models for structured sequences, but it is distinct from several neighboring architectures that may appear similar at the level of terminology. It is not a lookup-expert model like MoLKV, whose design centers on token-id-associated experts and context-aware retrieval from cached key-value experts for on-device LLM inference (Wang, 10 Dec 2025). It is not the multilingual streaming ASR architecture built from Mixture-of-Expert Conformer layers inside a causal or cascaded Conformer encoder (Hu et al., 2023). It is also distinct from sign-language systems such as MoMP, which decomposes sign language production into translation and animation with a transformer-based mixture-of-experts over motion primitives (Saunders et al., 2021), and from Sign2(LID+Text), which addresses multilingual gloss-free sign language translation through dual CTC objectives for token-level sign language identification and spoken text generation (Tan et al., 30 May 2025).

The principal limitation stated for MoLingo is narrow but consequential: it focuses on main body dynamics and does not generate detailed hand movements / full-body hand articulation (He et al., 15 Dec 2025). This is particularly significant because fine-grained hand behavior is often central to perceptual realism and semantic specificity in motion synthesis. A plausible implication is that, despite strong standard benchmark performance, the method leaves open a substantial modeling gap for hand-centric actions and higher-fidelity whole-body motion.

The broader research significance claimed for MoLingo is methodological. It suggests that latent space design matters as much as the generator, that semantic regularization can improve diffusion-like modeling in motion domains, and that multi-token conditioning is preferable to collapsing text into a single embedding when prompts contain fine-grained or compositional information (He et al., 15 Dec 2025). Within the supplied literature, that positions MoLingo as a T2M model whose contribution lies less in introducing a wholly new generative paradigm than in specifying a particular combination of semantically structured latent geometry, cross-attentive language conditioning, and masked auto-regressive rectified-flow generation.