Papers
Topics
Authors
Recent
Search
2000 character limit reached

HumanML3D: Text-to-Motion Dataset

Updated 13 February 2026
  • HumanML3D is a large-scale, text-annotated human motion dataset with 14,616 sequences and 44,970 action captions designed for text-to-motion generation.
  • It pairs each motion with free-form English descriptions and, in its HumanML3D++ extension, adds scene texts for arbitrary text-to-motion mapping.
  • Benchmarking protocols using metrics like R-Precision, FID, and MM-Dist enable robust evaluation of deep generative models in synthesizing realistic human motions.

HumanML3D is a large-scale, text-annotated human motion dataset specifically created for text-to-motion generation tasks, where the objective is to synthesize plausible human motions conditioned on natural-language descriptions. It originates from the AMASS motion collection and aims to address the lack of richly captioned and varied human action data necessary for training and benchmarking deep generative models that align motion with textual semantics. Recent extensions, notably HumanML3D++, have significantly broadened the dataset’s scope by introducing scenario-oriented “scene texts,” enabling evaluation on arbitrary text-to-motion grounding beyond explicit action labels (Wang et al., 2024, Zhang et al., 2024).

1. Dataset Construction and Core Features

HumanML3D comprises 14,616 distinct motion sequences sourced from the AMASS collection. Motion data are down-sampled or clipped to frame lengths ranging from 40 to 196 per sequence for training purposes, with a representational pipeline configured for compatibility with VQ-VAE tokenization (up to 50 discrete tokens per motion after temporal downsampling by a factor =4\ell = 4). Technical metadata such as total wall-clock duration, sampling frequency, or skeleton hierarchy (joint count, naming, structure) is not specified in the source literature (Wang et al., 2024).

Unlike most motion datasets that assign static “action classes” or labels, HumanML3D pairs each motion with 3–5 free-form English descriptions—yielding a total of 44,970 textual action captions. These annotations are natural sentences (typically 7–15 words) designed to express the performed actions with no constraints on phrasing or structure. There exists no closed-set taxonomy or “class ID”; text-matching and model conditioning are conducted solely by natural-language prompts (Wang et al., 2024).

No official train/validation/test split is published in the original dataset documentation. Prior works commonly use an 80/10/10 split, but this is not specifically restated for HumanML3D or HumanML3D++ in the core sources (Wang et al., 2024, Zhang et al., 2024).

2. HumanML3D++: Expansion to Arbitrary Texts

HumanML3D++ extends HumanML3D by associating each existing action-text annotation with “scene texts”—sentences describing plausible event contexts without directly stating the performed action. Scene texts are automatically generated using an LLM (e.g., GPT-3), prompted to produce two antecedent sentences that do not reuse verbs from the corresponding action description. Each action caption thus receives two scene captions, generating 134,910 new scene sentences in total (i.e., 3×44,9703 \times 44{,}970) (Wang et al., 2024).

Scene text quality is maintained through manual filtering: 15% of the generated scene texts were evaluated by 20 human raters, with a 94% pass rate for reasonable antecedence, followed by additional cleaning of abnormal generations. This expanded the dataset to 179,880 text–motion pairs, all referencing the same 14,616 motion sequences as HumanML3D. Each sequence now typically has 3–5 action texts and 6–10 scene texts as conditioning sources (Wang et al., 2024). No additional motion capture data were introduced; only the annotation diversity was increased.

This expansion enables the study of “arbitrary text-to-motion” mapping, supporting research into ambiguities and multimodal correspondences between free-form text and physical motion (Wang et al., 2024).

3. Motion and Pose Representation

Motion data within HumanML3D are encoded using the representational conventions established by T2M (Text2Motion) [Guo et al. 2022], specifically:

  • 3D joint rotations
  • 3D joint positions
  • Joint velocities
  • Foot-contact indicators

The specific dimensionality per joint or full pose vector is not enumerated in HumanML3D documentation—these details are adopted from the T2M pipeline. Motion data are typically encoded/decoded using a VAE architecture (as per MLD [Chen et al. 2023]) with latent space zR2×dz \in \mathbb{R}^{2 \times d}, where “two channels” usually refer to (positions, velocities) or (positions, rotations), but exact mappings are not specified. No nonstandard normalization, augmentation, or per-joint transformations are described beyond those in T2M and MLD (Zhang et al., 2024).

A specific “HumanML3D-LS” subset is defined in related work for evaluation on long sequences (motions longer than 190 frames) to stress-test temporal consistency (Zhang et al., 2024).

4. Benchmarking Protocols and Evaluation Metrics

The predominant evaluation metrics applied on HumanML3D and HumanML3D++ include:

Metric Definition (when specified) Purpose
R-Precision (R@k) Fraction where true text ranks top-kk among 1 ground-truth + 31 (or 99) distractor texts, by feature similarity Retrieval accuracy
FID μrμg22+Tr(Σr+Σg2(ΣrΣg)1/2)\|\mu_r - \mu_g\|_2^2 + \mathrm{Tr}(\Sigma_r + \Sigma_g - 2(\Sigma_r\Sigma_g)^{1/2}) Realism/diversity of generated motions
MM-Dist E(T,M)[fmotion(M)ftext(T)2]\mathbb{E}_{(T,M)}[\|f_{\rm motion}(M)-f_{\rm text}(T)\|_2] Text–motion multimodal alignment
Diversity 1300(i,j)fifj2\frac{1}{300} \sum_{(i,j)}\|f_i - f_j\|_2 across 300 generated pairs Inter-sample variation
MModality ET[110(i,j)fifj2]\mathbb{E}_{T}[\frac{1}{10}\sum_{(i,j)}\|f_i - f_j\|_2] over 20 motions per text Within-text multimodality

R-Precision and MM-Dist assume a unique ground-truth text–motion pair, and thus may penalize legitimate variations admitted by ambiguous or arbitrary texts (e.g., scene texts that plausibly support multiple actions). FID compares the embedding distribution of real and generated motions, but does not fully characterize the diversity induced by generalized scene prompts (Wang et al., 2024, Zhang et al., 2024). Diversity and MModality specifically address sample and within-prompt variation, respectively.

All published baselines (e.g., Seq2Seq, LJ2P, T2G, Hier, TEMOS, T2M, MDM, MotionDiffuse, MLD) use the identical feature extraction pipeline and splits for reproducibility (Zhang et al., 2024).

5. Empirical Performance and Use in State-of-the-Art Research

HumanML3D serves as the primary testbed for recent advances in text-to-motion models, including the Motion Mamba architecture (Zhang et al., 2024). Quantitative results (summarized in the table below) benchmark R-Precision, FID, MM-Dist, Diversity, and MModality on the standard test set.

Method R@1 (±CI) FID (↓, ±CI) MM Dist (↓, ±CI) Diversity (→, ±CI) MModality (↑, ±CI) Avg. inference (s)
Real 0.511±0.003 0.002±0.000 2.974±0.008 9.503±0.065
MotionDiffuse 0.491±0.001 0.630±0.001 3.113±0.001 9.410±0.049 1.553±0.042
MLD 0.481±0.003 0.473±0.013 3.196±0.010 9.724±0.082 2.413±0.079 0.217
Motion Mamba 0.502±0.003 0.281±0.009 3.060±0.058 9.871±0.084 2.294±0.058 0.058

Motion Mamba achieves state-of-the-art FID (0.281±0.009), approaching the realism/diversity of true samples, and delivers a fourfold speedup in inference relative to prior best diffusion models. The HumanML3D-LS subset (motions >190 frames) further verifies long-term dependency modeling. User studies confirm higher subjective realism and correspondence for Mamba-produced motions.

6. Context, Significance, and Limitations

HumanML3D and its extension HumanML3D++ have become foundational resources for text-conditioned motion synthesis, virtual human interaction, and related areas, providing a benchmark with extensive pairing of action and scene captions to realistic, fine-grained motion capture (Wang et al., 2024). The move from action-centric to arbitrary text-to-motion tasks, supported by the HumanML3D++ expansion, challenges existing methods by introducing annotation ambiguities and multimodal mapping, thus requiring new evaluation protocols and diversity-sensitive metrics.

A major limitation is the lack of detailed metadata regarding skeleton topology, joint parameterization, and official split definitions, which impedes direct regression testing and structural ablation. The reliance on automated scene text generation, though human-filtered, introduces potential annotation noise. Furthermore, traditional single-solution metrics inadequately reward the legitimate diversity of outcomes when annotating implicit, scenario-based texts.

A plausible implication is that ongoing development of multimodal and distributional metrics, as well as datasets that encode explicit uncertainty or behavioral diversity, will be necessary to further advance the field. HumanML3D/3D++ offer a platform for these investigations in both academic and applied motion synthesis communities (Wang et al., 2024, Zhang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HumanML3D Dataset.