Complex Motion Dataset (CompMo)

Updated 11 November 2025

CompMo is a large-scale dataset richly annotated for temporally grounded 3D human motion captioning with long sequences and multiple atomic actions.
It features an automated three-stage data generation pipeline that ensures precise temporal and linguistic annotations without relying on manual frame-level checks.
The dataset offers enhanced evaluation metrics and comprehensive benchmarks, driving research in dense captioning and fine-grained motion-language alignment.

The Complex Motion Dataset (CompMo) is a large-scale, richly annotated corpus designed for advanced research in 3D human motion understanding and dense captioning. CompMo uniquely addresses the limitations of prior motion-language datasets by providing lengthy motion sequences (average 39.88 seconds) densely populated with multiple atomic actions (2–10 per sequence), each described in natural language with accurately annotated temporal boundaries. Through a rigorously automated data generation pipeline, CompMo delivers a new benchmark for temporally grounded captioning and the study of compositional, context-rich human motion.

1. Dataset Structure and Composition

CompMo consists of 60,000 motion sequences, each uniformly sampled to contain between 2 and 10 atomic actions. The distribution of actions per sequence yields a roughly flat histogram, ensuring coverage across a wide variety of compositional movement. Each atomic action is assigned a distinct textual caption, resulting in approximately 11,000 unique atomic descriptions across the dataset. Captions per sequence are equal to the number of atomic actions present, and the average caption length is 37.74 words, contrasting sharply with shorter, single-sentence annotations found in previous datasets (HumanML3D: 12.0; BABEL: 11.06).

2. Data Generation Pipeline

The construction of CompMo follows a three-stage automated pipeline:

Atomic Actions Collection Source motions are extracted from a simple subset of HumanML3D (verbs ≤ 1). Two generation strategies are employed:
- Generation from scratch using the diffusion-based MDM-SMPL model (STMC), with each atomic text as conditioning. Cosine similarity in TMR embedding space is computed; samples with similarity < 0.5 are rejected.
- For unsuccessful generations, the original HumanML3D motion is drawn directly. The finalized atomic pool comprises 7,503 purely generated atomics and 3,619 human-sourced atomics, all quality-controlled using a minimum TMR similarity threshold of 0.5.
Textual Description Composition Between 2 and 10 atomic descriptions are randomly sampled and concatenated into sequences, with precise timestamp tags in “mm:ss:ms – text” format. Duration of each segment is conditioned via:

$T \sim \bigl[\,\beta\,T_{gt} + \alpha,\ \min\bigl((2-\beta)T_{gt} + \alpha,\ T_{gt} + \beta + 1\bigr)\bigr]$

where $\alpha=0.3$ and $\beta=0.8$ , introducing small perturbations for realism.

Motion Sequence Generation Denoising and sequencing utilizes DiffCollage (for temporal stitching) and STMC’s body-part stitching. Generated atomics commence from pure noise (100 denoising steps), while human-drawn segments undergo 30 forward-diffusion steps followed by 100 denoising steps. Transitions are smoothed by 0.5 s overlap between consecutive actions. Body-part annotations are generated using GPT-4 to guide STMC.

3. Temporal Annotation Protocol

Every atomic caption is rigorously paired with a start time ( $s_i$ ) and end time ( $e_i$ ), denoted in the metadata as “mm:ss:ms – caption.” Timestamps are programmatically generated during composition and are preserved through denoising. Quality validation is executed via adherence to the sampled duration distribution, atomic-level TMR similarity filtering, and temporal plausibility constraints (including overlap enforcement). Manual frame-by-frame annotation is not present; all temporal labeling is algorithmically derived, eliminating potential subjectivity and human error.

4. Comparative Dataset Analysis

CompMo substantially exceeds previous 3D motion–language datasets in coverage, granularity, and annotation precision. Table 1 from (Xu et al., 7 Nov 2025) summarizes key comparative statistics:

Dataset	Sequences	Avg Duration	Annotation Style
KIT-ML	3,911	10.33 s	Single-sentence
HumanML3D	14,616	7.1 s	Single-sentence
BABEL	13,220	12.26 s	Label-only
MotionX	81,084	6.4 s	Single-sentence
MotionX++	120,462	5.4 s	Single-sentence
FineMotion	14,616	7.1 s	Fine-grained body-part
CompMo	60,000	39.88 s	Dense, timestamped

Action category frequencies in CompMo reveal a long-tailed Zipf distribution: frequent actions (e.g., “walk forward”) are highly represented, while rare actions appear singularly.

5. Evaluation Metrics and Analytical Formulas

CompMo supports a comprehensive set of quantitative metrics for both temporal and linguistic evaluation:

Temporal Intersection over Union (IoU):

$\mathrm{IoU}(t_p, t_g) = \frac{\max(0, \min(e_p, e_g) - \max(s_p, s_g))}{\max(e_p, e_g) - \min(s_p, s_g)}$

Mean tIoU and F1 Score (F1@τ): Greedy matching maximizes IoU for predicted–ground truth pairs; the tIoU computes average overlap, while the F1 score is calculated at threshold $\tau$ .
Captioning Metrics: BLEU@n, ROUGE_L, METEOR, CIDEr, and SODA (a temporally aligned METEOR penalized for caption redundancy; SODA(B) incorporates BERTScore).
Motion–Text Alignment: TMR similarity (cosine distance in shared embedding space) and CAR (Chronologically Accurate Retrieval), which measures retrieval accuracy under temporal caption shuffling.
Cross-Entropy Loss for Training:

$p(\mathbf{y}|m,x_{\text{inst}}) = \prod_{i=1}^{L} p_\theta(y_i|m,x_{\text{inst}}, y_{<i}), \quad \mathcal{L} = -\sum_{i=1}^{L} \log p_\theta(y_i|\cdots)$

6. Strengths, Limitations, and Recommendations

Strengths

Scale and Complexity: The dataset's large size (60,000 sequences) and multi-action composition facilitate detailed study of “dense” captioning tasks.
Temporal Precision: Timestamp boundaries are generated to millisecond-level accuracy and algorithmically validated.
Linguistic Richness: An average caption length of 37.74 words accommodates nuanced description of motion subtleties.
Pipeline Reproducibility: Automated generation with TMR filtering yields consistent data fidelity.

Limitations

Transitional Coherence: Random action composition can lead to semantically abrupt or implausible transitions (e.g., “swimming” followed by “basketball”).
Annotation Granularity: Absence of human-verified frame-level annotation may impose limitations on tasks requiring high-resolution action segmentation.
Semantic and Causal Constraints: No explicit enforcement of causality or context continuity across concatenated actions.

Recommendations for Future Research

Pretraining on simpler datasets (e.g., HumanML3D) improves motion–language alignment prior to dense-caption instruction tuning using CompMo.
Continuous motion adapters (linear + MLP) outperform VQ-VAE tokenization by mitigating discretization artifacts.
Sliding-window encoding (window size $W=16$ , stride $S=8$ ) enhances efficiency in processing long sequences.
LoRA is recommended for lightweight LLM adaptation, with focused updates on Q/V projections.
Consistent application of TMR-based filtering maintains motion–text alignment fidelity in any dataset extension or fine-tuning procedure.
Semantic coherence should be prioritized by clustering atomic actions from common activity contexts during sequence composition.

This suggests that researchers seeking advancements in motion captioning or temporal localization will benefit from CompMo’s methodological rigor and extensive annotation, but should be attentive to issues of semantic connectivity and annotation granularity.

7. Significance and Impact

By introducing a corpus characterized by scale, granularity, and linguistic variety, CompMo redefines the benchmarks for dense motion captioning and temporally resolved 3D motion understanding. Its automated, reproducible pipeline sets a precedent for quality control in dataset generation, while its evaluation framework enables precise quantitative assessment. As the field evolves towards increasingly compositional and context-driven models, CompMo provides the necessary substrate for advancing temporal localization, fine-grained captioning, and alignment between motion and language modalities. Researchers are encouraged to adopt best practices outlined above to extend dataset utility and efficacy in future motion understanding tasks.

PDF Markdown Chat (Pro)

References (1)

Dense Motion Captioning (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Complex Motion Dataset (CompMo).