HumanML3D: 3D Motion-Language Benchmark

Updated 22 November 2025

HumanML3D is a large-scale benchmark that aligns 3D human motions with detailed language descriptions for text-to-motion generation and cross-modal retrieval.
It re-annotates AMASS and HumanAct12 datasets to provide 14,616 motion sequences paired with multiple English texts, processed with standardized frame rates and normalization.
Innovative model techniques, including VQ-VAE tokenization and attention-based dynamic masking, yield state-of-the-art results on metrics like FID and R-Precision.

HumanML3D is a large-scale human motion-language benchmark designed for evaluating models that connect natural language and 3D human motion. It has become the central testbed for research in text-conditioned motion generation, cross-modal retrieval, and, more recently, arbitrary text-to-motion synthesis. HumanML3D provides a richly annotated dataset, well-established evaluation protocols, and a reference point for comparative studies on both generation and retrieval tasks.

1. Dataset Structure and Preprocessing

HumanML3D was constructed via re-annotation of AMASS and HumanAct12 MoCap collections and contains 14,616 distinct motion sequences (“entries”), each associated with 3–5 independent English language textual descriptions (44,970 pairs in total). These action texts describe human actions at varying granularity. Motion sequences are encoded as 3D joint trajectories, downsampled at 20 Hz. Standardized preprocessing removes global translation (root-centering), normalizes sequences, pads those shorter than 196 frames, and truncates longer ones to a maximum of 196 frames (9.8–10 seconds). Each motion’s representation typically includes SMPL-based features—root velocities, 6D joint rotations, joint/global positions, velocities, and foot-contact signals (263-dim vectors per frame in some retrieval protocols) (Zhang et al., 10 Mar 2025, Fu, 2024, Yan et al., 2023).

The canonical train/test/validation split is 80%/15%/5% by motion entry. Alternate retrieval protocols may select a subset (e.g., 1,000-clip test set), treating sub-motions as independent for tasks focusing on fine-grained action discriminability (Yan et al., 2023).

2. Tasks Supported and Modalities

HumanML3D is the standard benchmark for several key tasks:

Text-to-Motion Generation: Mapping a free-form textual command to a plausible 3D human motion sequence (sequence-to-sequence modeling).
Cross-modal Retrieval: Ranking motions by text queries and/or ranking text by motion queries. Both text→motion and motion→text retrieval are supported.
Diversity Modeling: Evaluating generative models' ability to produce multiple plausible motion outputs from a single textual prompt.
Scene-based and Arbitrary Text: While the original benchmark contains only action descriptions, extensions such as HumanML3D++ introduce scene texts (e.g., descriptions of context or environment that imply an action but do not state it directly) (Wang et al., 2024).

No audio or other modalities are present in the original dataset or in the core benchmarks as used by state-of-the-art models (Zhang et al., 10 Mar 2025).

3. Evaluation Metrics and Benchmarks

The evaluation protocol for HumanML3D is highly standardized, centered on quantitative measures of generation quality, cross-modal alignment, and diversity:

Metric	Purpose	Standard Definition
FID ↓	Distributional similarity between real/generated motion	$FID = \\|\mu_r-\mu_g\\|_2^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2})$
R-Precision ↑	Retrieval accuracy (top-K recall)	$R\text{-Precision@K} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}(\mathrm{GT}_i\in\mathrm{TopK}(M_i))$
MM-Dist ↓	Text-motion embedding agreement (faithfulness)	$MM\text{-Dist} = \frac{1}{N}\sum_{i=1}^N d(f_c(c_i), f_m(\hat m_i))$
Diversity →	Sample diversity (pairwise motion feature distance)	Average pairwise distance of generated motions per prompt
MultiModality ↑	Output diversity for identical prompt	Average variance across features from multiple generations of same prompt
Recall@K ↑	Retrieval: fraction true pair in Top-K	Applicable for both text→motion and motion→text (Yan et al., 2023)
Median Rank ↓	Retrieval rank of ground-truth	As above
R-sum ↑	Retrieval: sum of six R@K (bi-directional)

Single-solution metrics (FID, R-Precision, MM-Dist) compare generated outputs to a fixed ground truth. Multi-solution metrics (Diversity, MultiModality) account for the fact that some queries admit multiple correct motions (Wang et al., 2024).

4. Models and Comparative Results

Recent benchmarks on HumanML3D demonstrate the relative strengths of various modeling approaches:

Masked Transformer Models (e.g., BAMM, MoGenTS, MMM, MoMask): These BERT-style models score highly on FID and retrieval tasks but traditionally lack autoregressive streaming capabilities.
Autoregressive GPT-style Models (e.g., T2M-GPT, AttT2M): These enable streaming generation but have lagged in FID and diversity metrics.
Hybrid/Advanced Architectures:
- Motion Anything (Zhang et al., 10 Mar 2025) introduces masked-transformer networks with explicit attention-based masking over temporally- and spatially-salient frames and joints, achieving state-of-the-art FID (0.028) and R-Precision scores (Top-1: 0.546) on standard test splits. The key innovation is prioritizing mask selection to focus on condition-relevant regions, with a robust transformer block depth ( $N=4$ ) and optimal masking ratio ( $\alpha=30\%$ ).
- Mogo (Fu, 2024) leverages a hierarchical causal transformer with RVQ-VAE tokenization, producing competitive FID (0.079) while supporting longer coherent sequence generation (up to 260 frames). Mogo achieves superior out-of-distribution generalization (OOD) as measured on the CMP dataset.
Retrieval-focused Models:
- DropTriple Loss (Yan et al., 2023): A dual-stream Transformer with a novel triplet loss that removes semantically overlapping "false negative" negatives, surpassing prior triplet-based retrieval scores (e.g., $R@10$ for text→motion of 62.9% and for motion→text of 71.5% on HumanML3D).
Arbitrary Text to Motion:
- TAAT Framework (Wang et al., 2024): Extends HumanML3D to HumanML3D++ by introducing LLM-generated "scene texts," then uses a two-stage LLM+Transformer approach. Evaluates both single- and multi-solution metrics to benchmark robustness to arbitrary input phrasing and diverse output.

The following table summarizes reported FID and R-Precision@1 results for selected models on HumanML3D:

Method	FID ↓	R-Prec@1 ↑
T2M-GPT	0.116	–
MMM	0.080	–
MoGenTS	0.033	0.529
Motion Anything	0.028	0.546
Mogo	0.079	–

Note: A direct comparison of retrieval metrics across all models is sometimes impeded by differences in test splits or evaluation granularity.

5. Architectural and Methodological Innovations

Multiple advances in HumanML3D modeling are anchored in both motion representation and alignment with language:

VQ-VAE/RVQ-VAE Tokenization: Discretization of motion sequences enables transformer architectures to operate upon compact, highly-compressible codebooks without sacrificing precision (Fu, 2024, Wang et al., 2024).
Attention-based Dynamic Masking: Mask selection conditioned on attention scores highlights condition-relevant frames and joints, optimizing masked transformer reconstruction (Zhang et al., 10 Mar 2025).
Modality-Aware Attention: Self-attention or cross-modal attention layers are selectively applied depending on condition type (text or otherwise), as empirically shown to affect FID (Zhang et al., 10 Mar 2025).
Code-masking Data Augmentation: Stochastically substituting VAE codebook indices during training improves generalization, especially for longer/coherent sequence generation (Fu, 2024).
DropTriple Loss: Dynamic pruning of "false negatives" in triplet objectives based on computed motion or text similarities prevents penalization of semantically similar samples and increases retrieval accuracy (Yan et al., 2023).
Arbitrary-to-Action Decomposition: LLM-based text2action modules extract action labels from arbitrary descriptions, which are then converted to motion tokens by transformers—this modularizes and regularizes generation from more ambiguous prompts (Wang et al., 2024).

A plausible implication is that the benchmark’s strong adoption across architectures has fostered methodological convergence and cross-pollination between generative and retrieval-focused communities.

6. Extensions and Future Prospects

HumanML3D’s extensibility is evidenced by HumanML3D++ (Wang et al., 2024), which augments original action-labeled data with tens of thousands of "scene texts" generated and filtered via LLMs, enabling benchmarks for the more challenging task of arbitrary text–to–motion translation. This is accompanied by enhanced metrics evaluating diversity and multimodality of generated outputs, reflecting the one-to-many mapping between scene descriptions and plausible motion outcomes.

Moving forward, further annotation (multi-subject, richer semantic context), introduction of multimodal inputs (e.g. audio, scene graphs), and community standardization of metrics for multimodal diversity are anticipated. Empirical studies show that while current approaches generalize to near-domain text, true understanding of abstract or compositional scene context remains an open challenge.

7. Significance and Limitations

HumanML3D has rapidly established itself as the cornerstone for progress in natural language to 3D motion research, with its task structure, evaluation practices, and extensibility underpinning advances in generation, retrieval, and generalization. Despite this, limitations persist in action diversity, lack of multimodal conditioning, and in evaluation protocols for more open-ended or contextually rich queries. The benchmark’s pervasive adoption continues to motivate cross-pollination of ideas, rigorous ablation, and algorithmic advances that further the state-of-the-art in text-conditioned human motion understanding and synthesis (Zhang et al., 10 Mar 2025, Fu, 2024, Yan et al., 2023, Wang et al., 2024).