RMEG: Referring Motion Expression Generation
- The paper introduces RMEG as a multimodal task that generates unambiguous text or gestures to describe dynamic object motion and spatial relations.
- It employs video, audio, scene, and textual inputs with methods like temporal convolutions and diffusion-based gesture decoding to ensure explicit grounding.
- Evaluation uses metrics such as METEOR, CIDEr, and FID, demonstrating improved performance through integrated multimodal fusion and physics-based supervision.
Referring Motion Expression Generation (RMEG) encompasses a class of multimodal generation tasks that synthesize temporally grounded, unambiguous linguistic or gestural outputs which refer specifically to entities and their motion or spatial relations from input video, audio, scene, or textual data. The goal is to produce output—either as natural language (text or speech) or as physically plausible gesture sequences—that successfully disambiguates a referent by encoding its characteristic movement or spatial configuration, as distinct from static descriptions. RMEG is foundational for creating embodied agents capable of situated, communicative interaction, and for advancing fine-grained video understanding that incorporates dynamic reference (Deichler et al., 6 Jul 2025, Ding et al., 11 Dec 2025).
1. Formal Definitions and Problem Setup
RMEG is instantiated in two primary regimes:
- Language-centric RMEG: Given a video of frames, and binary object masks tracking target objects, generate a referring expression that highlights the (possibly collective) motion of those objects. Mathematically, the objective is to learn a generator that maximizes
Training is conducted via maximum likelihood over reference pairs :
- Gesture-centric RMEG: For a given utterance , 3D scene state (object positions, scene layout), and ground-truth HumanML3D motion sequence , learn a mapping
such that is a physically plausible gesture sequence spatially referring to the intended object(s). (Deichler et al., 6 Jul 2025)
Key to both variants is the requirement for explicit grounding: the generated output must uniquely localize the target temporally and spatially by leveraging the motion context. This precludes "static" expressions and demands compositional language or bodily movement tightly coupled to observed or intended dynamics.
2. Datasets and Data Representation
2.1. MeViS: Multimodal Video Corpus for RMEG
The MeViS dataset supports RMEG via a large corpus of 2,006 videos, each annotated with binary masks for 8,171 distinct objects and paired with 33,072 motion-centric referring expressions. These annotations, distributed equally across train, validation, and test splits, are constructed under constraints that enforce motion-centric referencing (e.g., "flying away," "turning around"), systematically discouraging the use of static descriptors. For each example, the dataset provides both text and time-aligned speech forms, as well as per-frame pixel-level masks (approximately 443,000 frames in total) (Ding et al., 11 Dec 2025).
2.2. Grounded Gesture Generation Datasets
Two datasets underpin gesture-based RMEG:
- Synthetic Pointing Motions (ST): Contains 1,135 motion-capture clips of isolated pointing gestures, each with ground-truth 3D target locations, and TTS-generated speech aligned with motion using the Hungarian algorithm. Clip duration averages 4.85 seconds.
- MM-Conv VR Dialog: Comprises 6.14 hours of dyadic VR dialogues in AI2-Thor environments, including 2,394 referential (gesture + object reference) and 2,721 non-referential interactions with full annotation of utterance, object IDs, reference types, and timestamps.
Both are standardized in HumanML3D, encoding each pose frame as a 263-dimensional real vector: root velocities, joint positions, rotations, angular velocities, and binary foot-contact flags (Deichler et al., 6 Jul 2025).
| Dataset | Type | Annotation | Unique Feature |
|---|---|---|---|
| MeViS | Video+Text | Pixel masks, Audio | Strict motion-based language, objects |
| ST (Synthetic) | Gesture+3D | MoCap, Speech | Isolated pointing, ground-truth 3D loc |
| MM-Conv | VR Dialog | Full-body, Lang | Dyadic, object-referential gestures |
2.3. Scene and Motion Encoding
- Motion: HumanML3D encodes pose as —root angular/linear velocities, pelvis-relative joint positions and rotations, joint angular velocities, and foot contact flags, at 20 fps.
- Scene: Each object in the scene is represented as a centroid , bounding box , and a vision-LLM (VLM) embedding . Target set is derived per utterance (Deichler et al., 6 Jul 2025).
3. Model Architectures and Training Objectives
3.1. Language-to-Motion: MM-Conv Framework
The MM-Conv architecture fuses language, scene, and motion in a modular structure:
- Language Encoder: Processes token or speech embeddings via 1D convolutions and optional self-attention.
- Scene Encoder: Aggregates object features or bounding boxes into via a GNN or MLP.
- Multimodal Fusion: Alternating language- and scene-conditional 1D temporal convolutions, propagating state
where denotes 1D convolution across temporal frames, and are projection matrices.
- Motion Decoder: Upsamples or uses a diffusion-based denoising strategy (e.g., OmniControl backbone) to synthesize the final HumanML3D sequence (Deichler et al., 6 Jul 2025).
3.2. Video-to-Text: Sequence-to-Sequence Baselines
In MeViS, four standard video captioning models are repurposed:
- GIT: Frozen CLIP/Swin encoder + Transformer decoder with cross-attention on per-frame features.
- VAST: Multi-modal Transformer combining vision, audio, and text modalities.
- NarrativeBridge: Temporal narrative alignment with attention-based event-phrase mapping.
- VideoLLaMA 2: LLM backbone, vision encoder prepended to LLM, instruction tuning, LoRA adapters.
All take as input video and a binary target mask overlay , but baseline architectures do not introduce explicit motion-aware modules (Ding et al., 11 Dec 2025).
3.3. Training Losses
- Cross-entropy generation loss for text outputs (Eq. 2). No architectural variants in MeViS baselines introduce motion-specific loss terms.
- For gesture synthesis, a composite objective is used:
where losses respectively encode pose reconstruction, adversarial realism, physical plausibility (collision, acceleration, foot constraints), and a grounding alignment term that enforces pointing toward the referred object (via wrist–elbow and object direction vectors).
- Physics-based supervision is optionally provided by a simulator to check and penalize wrist collisions or unreachable trajectories (Deichler et al., 6 Jul 2025).
4. Evaluation Protocols and Results
4.1. Metrics
- Language Output (MeViS):
- METEOR: unigram overlap with stemming/paraphrase.
- CIDEr: TF–IDF weighted n-gram similarity consensus.
- BLEU: (used, not always reported).
- No motion-alignment or temporal grounding metrics are applied (Ding et al., 11 Dec 2025).
- Gesture Output:
- Fréchet Inception Distance (FID): distributional distance over motion features.
- Spatial accuracy: Euclidean error in pointing direction to target.
- Control : distance in joint positions against spatial hints.
- Language coherence: BLEU/METEOR if utterance is jointly generated (Deichler et al., 6 Jul 2025).
4.2. Quantitative Findings
RMEG (Text) Baselines—MeViS Validation Set
| Method | METEOR | CIDEr |
|---|---|---|
| GIT | 12.33 | 18.20 |
| VAST | 10.66 | 20.42 |
| NarrativeBridge | 14.99 | 25.68 |
| VideoLLaMA 2 | 15.68 | 27.10 |
LLM-based solutions (NarrativeBridge, VideoLLaMA 2) outperform traditional encoder-decoder approaches by 2–5 points METEOR and 5–9 CIDEr. However, absolute scores remain low (METEOR < 16, CIDEr < 30), underscoring the challenge of generating unambiguous, motion-rich referring language in complex scenes (Ding et al., 11 Dec 2025).
Gesture Synthesis—OmniControl Fine-Tuning
- Fine-tuned models achieve 50–70% lower FID and errors compared to pretrained OmniControl.
- ST-only training yields best wrist control (mean m) and lowest FID ().
- REF+ST improves full-body metrics (pelvis FID ), slightly degrades wrist accuracy.
- Full combination (ALL) gives best generalization with minor trade-off in per-limb control.
Removing the grounding loss results in gestures that approximately reach but do not coherently orient to the correct target; omitting the adversarial term produces jittery, less natural motions (Deichler et al., 6 Jul 2025).
4.3. Common Failure Modes
- Baseline RMEG models often produce non-motion or ambiguous utterances given multiple similar targets.
- In gesture synthesis, naive conditioning or lack of fine-tuning leads to stereotyped "punch" motions or poor referent orientation.
- No current baseline in MeViS demonstrates reliable motion disambiguation or abstains from generation in "no-target" scenarios (Ding et al., 11 Dec 2025, Deichler et al., 6 Jul 2025).
5. Challenges, Limitations, and Future Directions
5.1. Known Limitations
- Most datasets rely on synthetic environments or tightly controlled motion capture, hampering direct transfer to unconstrained real-world settings.
- Expression coverage is limited: iconic, beat, or metaphoric gestures are largely excluded in current gesture-centric datasets.
- MeViS and gesture generation methods assume access to clean object masks or perfect detections, with no accommodation for perception noise.
- No explicit modeling of velocity, acceleration, or semantic event boundaries in current text generation backbones.
- No specialized loss or metric is yet established to maximize "unambiguity" or to penalize ambiguous referring expressions (Ding et al., 11 Dec 2025, Deichler et al., 6 Jul 2025).
5.2. Proposed Research Directions
- Develop motion-aware backbones (e.g., through explicit object trajectory encoding, graph neural networks) to improve temporal and spatial alignment of outputs.
- Introduce contrastive or re-grounding losses that reward distinctions between similar candidates, enabling generation of expressions with disambiguating modifiers.
- Unify perception and generation in joint training frameworks so that expression veracity is validated by attempted re-grounding to video or scene.
- Extend gesture corpora to cover a richer taxonomy (iconic, beat, metaphoric gestures) and to learn adaptive spatial representations (e.g., implicit neural scenes, occupancy grids).
- Integrate differentiable physics engines for feedback on collision, reachability, and biomechanical realism during training.
- Advance continual and few-shot adaptation methods for rapid domain or scene transfer.
6. Context and Significance in Broader Research
RMEG sits at the intersection of grounded language generation, motion understanding, video captioning, and embodied communication. It exposes foundational limitations of sequence-to-sequence captioning architectures and existing gesture synthesis pipelines with respect to spatial and temporal disambiguation. The introduction of large, systematically annotated datasets (MeViS for video-to-language, HumanML3D-based corpora for gesture) and evaluation protocols has enabled the quantification of specific failures in motion reference. Empirical findings highlight that incremental architectural improvements (e.g., multimodal fusion, LLM adaptation) advance performance but do not solve the core issue of grounded, discriminative reference. This suggests that future progress in human-computer interaction, video understanding, and embodied conversational agents will require new learning objectives and representations that fundamentally encode reference, motion, and the interplay between language, gesture, and scene context (Deichler et al., 6 Jul 2025, Ding et al., 11 Dec 2025).