FoundationMotion: Automated Motion Annotation

Updated 13 December 2025

FoundationMotion is a new framework that automatically extracts fine-grained motion annotations from videos using object-centric analysis and LLM-driven captioning.
It employs a multi-stage pipeline incorporating advanced detection, tracking, and motion-aware captioning to generate dense, semantically rich datasets.
By automating data curation, FoundationMotion significantly boosts vision-language model performance on motion reasoning benchmarks while reducing manual annotation costs.

FoundationMotion refers to a new class of models and datasets designed to serve as general-purpose, scalable, and semantically rich foundations for motion understanding, spatial reasoning, and downstream physical-world tasks across video, image, and human-centric domains. Motivated by the acute scarcity of large, fine-grained, and diverse motion annotations that stymie both spatial reasoning and physical scene modeling, FoundationMotion establishes fully automated pipelines that generate dense, high-quality motion datasets through object-centric video analysis and LLM–driven structured annotations. These resources enable robust fine-tuning of open-source vision-language and video-centric models—substantially improving performance on granulated motion understanding benchmarks, and in many cases surpassing prior specialized and closed-source large models.

1. Automated Data Curation Pipeline

The FoundationMotion pipeline systematically transforms raw video streams into structured, high-value motion data at scale without manual annotation bottlenecks (Gan et al., 11 Dec 2025):

Temporal Clip Extraction: For a video $V$ of duration $t_v$ , a randomly centered subclip of length $t_s \sim U(5, \min(10, t_v))$ is sampled near the temporal midpoint. Jitter $\epsilon \sim U(-0.2 t_v, 0.2 t_v)$ is added for diversity. Excessive camera motion is suppressed via a composite metric $s_m = \alpha \Delta_t + \beta \Delta_r + \gamma \max(\Delta_t) + \delta \max(\Delta_r)$ , and clips with $s_m > 0.3$ are discarded to prioritize object-centric dynamics.
Multi-System Object and Hand Detection: Qwen2.5-VL-7B proposes per-frame open-vocabulary object categories $𝒪$ on $I_0$ ; Grounded-DINO instances bounding box detection over these labels, while person regions are localized by Cascade Mask R-CNN (ViTDet-H, threshold $>0.8$ ).
Whole-Body and Hand-Object Association: ViTPose+ infers 42 keypoints per person (expanded hand regions), while Hands23 tracks hand-object contact with semantic states (no_contact, self_contact, object_contact, other_contact) and outputs additional object boxes for interacting hands. Entities are cross-associated by IoU $>0.3$ .
Spatial Tracking/Propagation: Starting from detected masks $M_0$ , SAM2 video propagation refines and tracks all object and part IDs across frames. New detections $ℬ_{new}$ can instantiate new tracklets. Resultant trajectories $b_i(t)$ are recorded as normalized bounding boxes; these data underpin all subsequent annotation.

This architecture establishes an end-to-end, dataset-scale system for extracting precise object- and agent-centric spatiotemporal structure from unconstrained video, eliminating reliance on manual labeling and greatly increasing scalability.

2. Automated Caption and QA Generation with LLMs

Once trajectories and interaction events are detected, FoundationMotion leverages LLMs (LLMs, specifically GPT-4o-mini) to generate both fine-grained captions and multiple-choice QA pairs grounded in video and structured motion JSONs (Gan et al., 11 Dec 2025):

Motion-Focused Captioning: Frames at 2 fps plus a JSON summary (“motion_info”) encoding per-object trajectories and hand/object interactions are passed to a system-primed LLM, which returns single comprehensive captions structured to cover seven semantic axes: action/gesture, temporal sequencing, object-action association, spatial context, repetition, motion dynamics (e.g., direction, velocity, trajectory), and evolution of spatial relationships. Captions are thus dense, compositional, and fine-tuned to motion-centric analysis.
Motion-Reasoning QA Generation: The generated caption and frames are passed to the LLM, which must return four-option multiple-choice questions (plus correct answer marked “A”) from five reasoning categories: motion recognition, temporal ordering, action-object association, location-context motion, and repetition counting. Distractors must be plausible and sampled from video content. The result is a diverse, balanced QA dataset that stresses not only simple action recognition, but also causal and spatial-temporal reasoning.

Table: Automated Annotation Modalities

Data Type	Method/Model	Content/Focus
Caption	GPT-4o-mini	Fine-grained, 7-axis motion/spatial relationships, free text
QA (MC-choice)	GPT-4o-mini	5 reasoning types (recognition, order, association, count, loc)

This process raises the density, diversity, and relevance of the annotation signal, supporting motion reasoning tasks far beyond conventional action classification.

3. Scale, Composition, and Structure of FoundationMotion Data

The resulting FoundationMotion dataset is one of the largest and most granular collections of object-motion–centric video annotation to date (Gan et al., 11 Dec 2025):

Scale:
- 46.7K auto-labeled video clips (mean duration 17.51 s)
- 467K multiple-choice questions (~10 per video, each with four options)
Coverage:
- Categories encompass vehicles, whole bodies, hands (left/right), and physically interacted objects
- 75% of questions are 30–80 characters; annotation density is 1.67 QA/s
- Motion types span single/multi-agent, person-object, and articulated hand actions
Statistical Uniformity:
- Correct answers are uniformly distributed across choice indices
- Clip durations are tightly concentrated (80% 3–7s)

A key aspect is the tight coupling of trajectories (explicit, per-frame bounding boxes, hand-object associations) to semantically rich language, enabling high-precision evaluation and fine-grained model supervision.

4. Model Training, Fine-Tuning, and Evaluation Protocols

Open-source vision-language and video-centric models (NVILA-Video-15B, Qwen2.5-VL-7B, etc.) are fine-tuned using FoundationMotion's large dataset of (video, caption, QA) triplets (Gan et al., 11 Dec 2025):

Loss Functions: $\mathcal{L}_{caption}$ (language modeling) and $\mathcal{L}_{QA}$ (cross-entropy over answer choices)
Batching / Optimization: Adam/AdamW, large-batch gradient accumulation, cosine-annealing learning rates ( $\approx$ 1–1.5 $\times10^{-5}$ ), batch size 32, compute on 8 $\times$ A100 GPUs
Evaluation: Performance is measured as accuracy (%) on public motion-QA benchmarks (MotionBench, VLM4D, AV-Car, AV-Hand, “Daily” hand-action QA, Robotics QA) and zero-shot generalization tasks. All are formatted as multiple-choice selection on held-out (video, QA) pairs.

Sample performance improvements when fine-tuning with FoundationMotion on NVILA-15B:

Task	Pretrained (%)	+FoundationMotion (%)	Δ
AV-Car	84.4	91.5	+7.1
AV-Hand	58.1	58.7	+0.6
Daily (How-motion)	76.2	78.6	+2.4
Robotics	21.4	36.3	+14.9

These gains typically exceed those of much larger closed-source models (e.g., Gemini-2.5-Flash on AV-Car: 84.1%) and unlock higher accuracy on fine-grained, previously underexplored spatial-motion tasks.

5. Analysis of Annotation and Evaluation Protocols

Comprehensive ablations and quality checks reveal several substantive findings (Gan et al., 11 Dec 2025):

Annotation Modality Enhancement: Supplying LLMs with explicit per-object motion JSONs, in addition to images, boosts overall QA quality (+2.3/10), fine-grained action accuracy (+2.6), motion detail (+2.6), temporal coherence (+2.4), and relevance (+1.6), as independently rated by GPT-4.
Reasoning-Type Complementarity: Fine-tuning on diverse QA types (recognition, order, association, counting, spatial) yields improved and more stable generalization; the largest gains accrue from counting and spatial grounding.
Model Size vs Data Quality: Mid-sized models trained on FoundationMotion can outperform much larger base or closed-source models, highlighting the impact of annotation quality and motion-centric supervision.
Current Limitations: FoundationMotion's annotations are currently 2D; 3D articulated hand pose and trajectory reasoning remains an open target. Very fast/small motions present detection and tracking failures. Solutions proposed include integrating multi-view or depth cues.

A plausible implication is that model performance on motion reasoning tasks is highly sensitive to the structure and richness of annotation, rather than just raw scale, and that motion-object–centric JSONs are critical for high-fidelity language grounding.

6. Impact, Limitations, and Future Directions

FoundationMotion advances the state of motion understanding by establishing a fully automated, scalable pipeline for generating rich, object-centric datasets, and by empirically demonstrating that these data can be used to fine-tune vision-LLMs for substantial performance gains across motion reasoning benchmarks (Gan et al., 11 Dec 2025). This approach resolves the previously prohibitive cost and subjectivity of manual annotation, greatly extends the scope and accuracy of spatial reasoning, and enables reliable benchmarking of both open-source and proprietary models.

Identified limitations include the current restriction to 2D spatial annotations, imperfect detection/tracking of fine-scale or ambiguous motions, and the challenge of extending these methods to articulated and three-dimensional contexts. Authors propose extensions in multi-view geometry, explicit 3D reasoning, and further automation of motion event decomposition. The pipeline's modularity and success indicate that analogous frameworks could support foundational motion models for domains beyond standard video, such as robotics, embodied AI, and human-object interaction research.

7. Relationship to Broader FoundationMotion Landscape

Within the emerging literature, FoundationMotion's fully automated annotation approach is highly complementary to other foundation motion initiatives, such as MotionBank's rule-based kinematic captions (Xu et al., 17 Oct 2024), MoFM's discrete variational representation with “Thermal Cubes” (Baharani et al., 8 Feb 2025), prototypical motion representation in ProMotion (Lu et al., 7 Jun 2024), and large-scale action video corpora. All center the notion that data curation (object tracking, fine-grained annotation, motion tokenization) is the core technical innovation required to bootstrap robust, generalizable, and interpretable motion models for a broad spectrum of downstream video-language and spatial reasoning tasks.

In summary, FoundationMotion operationalizes the construction of scalable, fine-grained, object- and part-centric motion datasets via fully automated video analysis and LLM-driven caption/QA synthesis, directly enabling state-of-the-art model performance, rigorous benchmarking, and new directions in interpretability and assessment of motion understanding capabilities (Gan et al., 11 Dec 2025).