Papers
Topics
Authors
Recent
2000 character limit reached

RMEG: Referring Motion Expression Generation

Updated 18 December 2025
  • The paper introduces RMEG as a multimodal task that generates unambiguous text or gestures to describe dynamic object motion and spatial relations.
  • It employs video, audio, scene, and textual inputs with methods like temporal convolutions and diffusion-based gesture decoding to ensure explicit grounding.
  • Evaluation uses metrics such as METEOR, CIDEr, and FID, demonstrating improved performance through integrated multimodal fusion and physics-based supervision.

Referring Motion Expression Generation (RMEG) encompasses a class of multimodal generation tasks that synthesize temporally grounded, unambiguous linguistic or gestural outputs which refer specifically to entities and their motion or spatial relations from input video, audio, scene, or textual data. The goal is to produce output—either as natural language (text or speech) or as physically plausible gesture sequences—that successfully disambiguates a referent by encoding its characteristic movement or spatial configuration, as distinct from static descriptions. RMEG is foundational for creating embodied agents capable of situated, communicative interaction, and for advancing fine-grained video understanding that incorporates dynamic reference (Deichler et al., 6 Jul 2025, Ding et al., 11 Dec 2025).

1. Formal Definitions and Problem Setup

RMEG is instantiated in two primary regimes:

  • Language-centric RMEG: Given a video VV of TT frames, and binary object masks M1,,MnM_1, \ldots, M_n tracking nn target objects, generate a referring expression yy that highlights the (possibly collective) motion of those objects. Mathematically, the objective is to learn a generator fθf_\theta that maximizes

y=argmaxyPθ(yV,M)y^* = \arg\max_y P_\theta(y \mid V, M)

Training is conducted via maximum likelihood over reference pairs (Vi,Mi;yi)(V_i, M_i ; y_i):

Lgen=i=1Nt=1LlogPθ(yityi<t,Vi,Mi)\mathcal{L}_{\text{gen}} = -\sum_{i=1}^{N} \sum_{t=1}^{L} \log P_\theta(y_i^t \mid y_i^{<t}, V_i, M_i)

(Ding et al., 11 Dec 2025)

  • Gesture-centric RMEG: For a given utterance LL, 3D scene state SS (object positions, scene layout), and ground-truth HumanML3D motion sequence G=(x1,,xT)G = (x_1, \ldots, x_T), learn a mapping

f:(L,S)G^f : (L, S) \rightarrow \hat{G}

such that G^\hat{G} is a physically plausible gesture sequence spatially referring to the intended object(s). (Deichler et al., 6 Jul 2025)

Key to both variants is the requirement for explicit grounding: the generated output must uniquely localize the target temporally and spatially by leveraging the motion context. This precludes "static" expressions and demands compositional language or bodily movement tightly coupled to observed or intended dynamics.

2. Datasets and Data Representation

2.1. MeViS: Multimodal Video Corpus for RMEG

The MeViS dataset supports RMEG via a large corpus of 2,006 videos, each annotated with binary masks for 8,171 distinct objects and paired with 33,072 motion-centric referring expressions. These annotations, distributed equally across train, validation, and test splits, are constructed under constraints that enforce motion-centric referencing (e.g., "flying away," "turning around"), systematically discouraging the use of static descriptors. For each example, the dataset provides both text and time-aligned speech forms, as well as per-frame pixel-level masks (approximately 443,000 frames in total) (Ding et al., 11 Dec 2025).

2.2. Grounded Gesture Generation Datasets

Two datasets underpin gesture-based RMEG:

  • Synthetic Pointing Motions (ST): Contains 1,135 motion-capture clips of isolated pointing gestures, each with ground-truth 3D target locations, and TTS-generated speech aligned with motion using the Hungarian algorithm. Clip duration averages 4.85 seconds.
  • MM-Conv VR Dialog: Comprises 6.14 hours of dyadic VR dialogues in AI2-Thor environments, including 2,394 referential (gesture + object reference) and 2,721 non-referential interactions with full annotation of utterance, object IDs, reference types, and timestamps.

Both are standardized in HumanML3D, encoding each pose frame as a 263-dimensional real vector: root velocities, joint positions, rotations, angular velocities, and binary foot-contact flags (Deichler et al., 6 Jul 2025).

Dataset Type Annotation Unique Feature
MeViS Video+Text Pixel masks, Audio Strict motion-based language, objects
ST (Synthetic) Gesture+3D MoCap, Speech Isolated pointing, ground-truth 3D loc
MM-Conv VR Dialog Full-body, Lang Dyadic, object-referential gestures

2.3. Scene and Motion Encoding

  • Motion: HumanML3D encodes pose as x=[ωn,vn,yn,p,R,R˙,c]x = [\omega_n, v_n, y_n, p, R, \dot{R}, c]—root angular/linear velocities, pelvis-relative joint positions and rotations, joint angular velocities, and foot contact flags, at 20 fps.
  • Scene: Each object sis_i in the scene SS is represented as a centroid ciR3c_i \in \mathbb{R}^3, bounding box BiR6B_i \in \mathbb{R}^6, and a vision-LLM (VLM) embedding viv_i. Target set SrefSS_{\text{ref}} \subseteq S is derived per utterance (Deichler et al., 6 Jul 2025).

3. Model Architectures and Training Objectives

3.1. Language-to-Motion: MM-Conv Framework

The MM-Conv architecture fuses language, scene, and motion in a modular structure:

  1. Language Encoder: Processes token or speech embeddings enRde_n \in \mathbb{R}^d via 1D convolutions and optional self-attention.
  2. Scene Encoder: Aggregates object features or bounding boxes into hSh_S via a GNN or MLP.
  3. Multimodal Fusion: Alternating language- and scene-conditional 1D temporal convolutions, propagating state

H(l+1)=σ(WH(l)+ULHL+UShS+b)H^{(l+1)} = \sigma(W * H^{(l)} + U_L H_L + U_S h_S + b)

where * denotes 1D convolution across temporal frames, and UL,USU_L, U_S are projection matrices.

  1. Motion Decoder: Upsamples or uses a diffusion-based denoising strategy (e.g., OmniControl backbone) to synthesize the final HumanML3D sequence G^\hat{G} (Deichler et al., 6 Jul 2025).

3.2. Video-to-Text: Sequence-to-Sequence Baselines

In MeViS, four standard video captioning models are repurposed:

  • GIT: Frozen CLIP/Swin encoder + Transformer decoder with cross-attention on per-frame features.
  • VAST: Multi-modal Transformer combining vision, audio, and text modalities.
  • NarrativeBridge: Temporal narrative alignment with attention-based event-phrase mapping.
  • VideoLLaMA 2: LLM backbone, vision encoder prepended to LLM, instruction tuning, LoRA adapters.

All take as input video VV and a binary target mask overlay MM, but baseline architectures do not introduce explicit motion-aware modules (Ding et al., 11 Dec 2025).

3.3. Training Losses

  • Cross-entropy generation loss for text outputs (Eq. 2). No architectural variants in MeViS baselines introduce motion-specific loss terms.
  • For gesture synthesis, a composite objective is used:

Ltotal=λrecLrec+λadvLadv+λphysLphys+λgroundLground\mathcal{L}_{\text{total}} = \lambda_{\text{rec}}\mathcal{L}_{\text{rec}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}} + \lambda_{\text{phys}}\mathcal{L}_{\text{phys}} + \lambda_{\text{ground}}\mathcal{L}_{\text{ground}}

where losses respectively encode pose reconstruction, adversarial realism, physical plausibility (collision, acceleration, foot constraints), and a grounding alignment term that enforces pointing toward the referred object (via wrist–elbow and object direction vectors).

  • Physics-based supervision is optionally provided by a simulator to check and penalize wrist collisions or unreachable trajectories (Deichler et al., 6 Jul 2025).

4. Evaluation Protocols and Results

4.1. Metrics

  • Language Output (MeViS):
    • METEOR: unigram overlap with stemming/paraphrase.
    • CIDEr: TF–IDF weighted n-gram similarity consensus.
    • BLEU: (used, not always reported).
    • No motion-alignment or temporal grounding metrics are applied (Ding et al., 11 Dec 2025).
  • Gesture Output:
    • Fréchet Inception Distance (FID): distributional distance over motion features.
    • Spatial accuracy: Euclidean error in pointing direction to target.
    • Control L2\mathrm{L}_2: distance in joint positions against spatial hints.
    • Language coherence: BLEU/METEOR if utterance is jointly generated (Deichler et al., 6 Jul 2025).

4.2. Quantitative Findings

RMEG (Text) Baselines—MeViS Validation Set

Method METEOR CIDEr
GIT 12.33 18.20
VAST 10.66 20.42
NarrativeBridge 14.99 25.68
VideoLLaMA 2 15.68 27.10

LLM-based solutions (NarrativeBridge, VideoLLaMA 2) outperform traditional encoder-decoder approaches by 2–5 points METEOR and 5–9 CIDEr. However, absolute scores remain low (METEOR < 16, CIDEr < 30), underscoring the challenge of generating unambiguous, motion-rich referring language in complex scenes (Ding et al., 11 Dec 2025).

Gesture Synthesis—OmniControl Fine-Tuning

  • Fine-tuned models achieve 50–70% lower FID and L2\mathrm{L}_2 errors compared to pretrained OmniControl.
  • ST-only training yields best wrist control (mean L20.058\mathrm{L}_2 \approx 0.058 m) and lowest FID (0.65\approx 0.65).
  • REF+ST improves full-body metrics (pelvis FID 1.03\rightarrow 1.03), slightly degrades wrist accuracy.
  • Full combination (ALL) gives best generalization with minor trade-off in per-limb control.

Removing the grounding loss results in gestures that approximately reach but do not coherently orient to the correct target; omitting the adversarial term produces jittery, less natural motions (Deichler et al., 6 Jul 2025).

4.3. Common Failure Modes

  • Baseline RMEG models often produce non-motion or ambiguous utterances given multiple similar targets.
  • In gesture synthesis, naive conditioning or lack of fine-tuning leads to stereotyped "punch" motions or poor referent orientation.
  • No current baseline in MeViS demonstrates reliable motion disambiguation or abstains from generation in "no-target" scenarios (Ding et al., 11 Dec 2025, Deichler et al., 6 Jul 2025).

5. Challenges, Limitations, and Future Directions

5.1. Known Limitations

  • Most datasets rely on synthetic environments or tightly controlled motion capture, hampering direct transfer to unconstrained real-world settings.
  • Expression coverage is limited: iconic, beat, or metaphoric gestures are largely excluded in current gesture-centric datasets.
  • MeViS and gesture generation methods assume access to clean object masks or perfect detections, with no accommodation for perception noise.
  • No explicit modeling of velocity, acceleration, or semantic event boundaries in current text generation backbones.
  • No specialized loss or metric is yet established to maximize "unambiguity" or to penalize ambiguous referring expressions (Ding et al., 11 Dec 2025, Deichler et al., 6 Jul 2025).

5.2. Proposed Research Directions

  • Develop motion-aware backbones (e.g., through explicit object trajectory encoding, graph neural networks) to improve temporal and spatial alignment of outputs.
  • Introduce contrastive or re-grounding losses that reward distinctions between similar candidates, enabling generation of expressions with disambiguating modifiers.
  • Unify perception and generation in joint training frameworks so that expression veracity is validated by attempted re-grounding to video or scene.
  • Extend gesture corpora to cover a richer taxonomy (iconic, beat, metaphoric gestures) and to learn adaptive spatial representations (e.g., implicit neural scenes, occupancy grids).
  • Integrate differentiable physics engines for feedback on collision, reachability, and biomechanical realism during training.
  • Advance continual and few-shot adaptation methods for rapid domain or scene transfer.

6. Context and Significance in Broader Research

RMEG sits at the intersection of grounded language generation, motion understanding, video captioning, and embodied communication. It exposes foundational limitations of sequence-to-sequence captioning architectures and existing gesture synthesis pipelines with respect to spatial and temporal disambiguation. The introduction of large, systematically annotated datasets (MeViS for video-to-language, HumanML3D-based corpora for gesture) and evaluation protocols has enabled the quantification of specific failures in motion reference. Empirical findings highlight that incremental architectural improvements (e.g., multimodal fusion, LLM adaptation) advance performance but do not solve the core issue of grounded, discriminative reference. This suggests that future progress in human-computer interaction, video understanding, and embodied conversational agents will require new learning objectives and representations that fundamentally encode reference, motion, and the interplay between language, gesture, and scene context (Deichler et al., 6 Jul 2025, Ding et al., 11 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Referring Motion Expression Generation (RMEG).