Papers
Topics
Authors
Recent
2000 character limit reached

MeViS Dataset: Video RMEG Resource

Updated 18 December 2025
  • MeViS Dataset is a large-scale corpus for video RMEG, featuring 2,006 videos paired with 33,072 motion-specific textual expressions and segmentation masks.
  • It provides precise frame-level annotations that combine motion trajectories with object references to support grounded language and motion synthesis.
  • The dataset enables rigorous evaluation using metrics like METEOR and CIDEr, driving advances in multimodal model architectures and disambiguation techniques.

Referring Motion Expression Generation (RMEG) encompasses a family of multimodal generation tasks where the system produces either motion trajectories (such as gestures) or time-extended linguistic descriptions, each unambiguously anchored to specific referents within a spatiotemporal context. RMEG thus situates itself at the intersection of grounded language understanding, embodied AI, and video segmentation, requiring the joint modeling of motion, scene geometry, and language to generate semantically and pragmatically precise outputs. There are two primary lines of work: motion synthesis conditioned on referential language and scene, and text generation that describes object motion referentially based on video/track inputs. Recent benchmark datasets and modeling architectures have been developed to address the unique multimodal and disambiguation challenges in this domain (Deichler et al., 6 Jul 2025, Ding et al., 11 Dec 2025).

1. Formal Definition and Task Setting

In RMEG, the system’s goal is to generate either a human gesture sequence or a natural-language utterance that refers uniquely to a target object (or set of objects) in a complex spatial or video context, with particular emphasis on the object’s motion dynamics.

  • Gesture-conditional RMEG: Given a referential utterance L=(w1,,wN)L = (w_1, \dots, w_N), scene SS (object positions and features), and context, output a motion sequence G=(x1,,xT)RT×dG = (x_1, \dots, x_T) \in \mathbb{R}^{T \times d} (e.g., pose vectors) such that the motion identifies the referent(s).
  • Video-captioning RMEG: Given a video V=(v1,,vT)V = (v_1, \dots, v_T) and a mask sequence M1,,MnM_1, \dots, M_n denoting tracked object(s), generate a sequence y=(y1,...,yL)y^* = (y^*_1, ..., y^*_L) that not only refers unambiguously to the object(s), but encodes motion-specific attributes over TT.

Formally, for language generation:

y=argmaxyPθ(yV,M)y^* = \arg\max_{y} P_\theta(y | V, M)

with maximum-likelihood training:

Lgen=i=1nt=1LlogPθ(yityi<t,Vi,Mi)\mathcal{L}_{\mathrm{gen}} = - \sum_{i=1}^n \sum_{t=1}^L \log P_\theta(y_i^t | y_i^{<t}, V_i, M_i)

(Ding et al., 11 Dec 2025).

For gesture generation:

G^=f(L,S;θ)\hat{G} = f(L, S; \theta)

with total composite loss:

Ltotal=λrecLrec+λadvLadv+λphysLphys+λgroundLground\mathcal{L}_{\mathrm{total}} = \lambda_{\mathrm{rec}}\mathcal{L}_{\mathrm{rec}} + \lambda_{\mathrm{adv}}\mathcal{L}_{\mathrm{adv}} + \lambda_{\mathrm{phys}}\mathcal{L}_{\mathrm{phys}} + \lambda_{\mathrm{ground}}\mathcal{L}_{\mathrm{ground}}

(Deichler et al., 6 Jul 2025).

2. Datasets and Multimodal Data Encoding

RMEG for Gesture Generation

The work of (Deichler et al., 6 Jul 2025) introduces two principal datasets:

  • Synthetic Spatially Grounded Pointing Gestures (ST): 1,135 motion-clips (mean duration 4.85 s) of isolated pointing, paired with synthesized utterances and precise 3D object targets. Speech-to-motion alignment is performed via the Hungarian algorithm.
  • MM-Conv VR Dialog Corpus (REF + NON-REF): 6.14 hours of dyadic VR conversations in the AI2-Thor simulator. Each clip encodes either referential (speech+gesture+object reference) or non-referential actions, annotated with object IDs, timestamps, and reference types.

Motion is encoded per frame as a 263-dimensional HumanML3D vector comprising root velocities, joint positions and rotations, angular velocities, and binary foot-contact flags. Scenes are encoded as object centroids, bounding boxes, and learned VLM (Vision-LLM) embeddings.

RMEG for Video-to-Text

The MeViSv2 dataset (Ding et al., 11 Dec 2025) provides a large-scale corpus for video-based RMEG:

  • 2,006 videos (1,712 train, 140 val, 154 test) with 33,072 annotated motion expressions in both text and speech formats.
  • Each expression is paired with frame-level binary masks marking the referred object(s), yielding ≈443,000 segmentation masks overall.
  • The annotation protocol forces use of motion-specific clues and precludes reliance on static object attributes.

Both corpora strictly align referent, language, and dynamic context, providing a high-quality resource for controlled multimodal modeling.

3. Model Architectures and Training Objectives

Gesture Generation: MM-Conv Framework

The MM-Conv framework (Deichler et al., 6 Jul 2025) implements a multimodal mapping f:(L,S)G^f: (L, S) \rightarrow \hat{G} as follows:

  • Language Encoder: Stacked 1D convolutions (optionally self-attentive) transform input token embeddings, yielding HLRN×HH_L \in \mathbb{R}^{N \times H}.
  • Scene Encoder: Graph NN or MLP aggregates object features/bboxes into a global scene vector hSRHh_S \in \mathbb{R}^H.
  • Multimodal Convolutional Fusion: Interleaves temporal convolutions, scene, and language context via:

H(l+1)=σ(WH(l)+ULHL+UShS+b)H^{(l+1)} = \sigma(W * H^{(l)} + U_L H_L + U_S h_S + b)

  • Motion Decoder: Applies upsampling or diffusion-based denoising (e.g., OmniControl backbone) to produce G^\hat{G}, progressively refining noise to a realistic HumanML3D pose sequence.

The loss combines:

  • Pose reconstruction (Lrec\mathcal{L}_{\mathrm{rec}}),
  • Adversarial sequence realism (Ladv\mathcal{L}_{\mathrm{adv}}),
  • Physical plausibility (Lphys\mathcal{L}_{\mathrm{phys}}; includes penalties for collision, acceleration, and foot-sliding),
  • Grounding alignment (Lground\mathcal{L}_{\mathrm{ground}}) minimizing misalignment between wrist-elbow pointing vector and object centroid direction.

Video-to-Text Generation

In (Ding et al., 11 Dec 2025), four baselines are adapted:

  • GIT: Frozen ViT/Swin video encoder, autoregressive Transformer with cross-attention.
  • VAST: Unified Transformer fusing vision, audio, and text.
  • NarrativeBridge: Causal-temporal narrative module with event-phrase alignment.
  • VideoLLaMA2: LLM instruction-tuned with frozen vision encoder and LoRA adapters.

All integrate mask overlays to focus attention on the tracked object(s). Training employs cross-entropy sequence generation, with no explicit motion-loss or temporal grounding term.

4. Evaluation Metrics and Results

Gesture RMEG

Evaluations in (Deichler et al., 6 Jul 2025) include:

  • Spatial Accuracy: Mean Euclidean distance between generated pointing direction and target centroid.
  • Fréchet Inception Distance (FID): Assesses motion naturalness using pretrained motion Inception embeddings.
  • Control Accuracy: L2\mathrm{L}_2-error between condition target (e.g., wrist) and generated trajectory.
  • Language Coherence: BLEU or METEOR scores when utterance is synthesized as well.

Fine-tuning on the ST set achieves wrist L20.058m\mathrm{L}_2 \approx 0.058\,\mathrm{m} and FID 0.65\approx 0.65—representing 50–70% improvement over base models. Ablations reveal that removing grounding loss degrades directional fidelity; omitting adversarial loss increases jitter and foot sliding.

Video RMEG

Standard caption metrics are used in (Ding et al., 11 Dec 2025):

  • METEOR: Unigram overlap with stemming/paraphrasing.
  • CIDEr: TF–IDF weighted nn-gram consensus.
  • BLEU: (Not reported numerically.)

On the MeViS val split:

Method METEOR CIDEr
GIT 12.33 18.20
VAST 10.66 20.42
NarrativeBridge 14.99 25.68
VideoLLaMA2 15.68 27.10

LLM-fusion methods (NarrativeBridge, VideoLLaMA2) lead all baselines, but absolute scores remain low, indicating difficulty in producing precise, unambiguous, and motion-aware referring utterances (Ding et al., 11 Dec 2025).

5. Common Failure Modes and Model Limitations

Reported issues include:

  • Ambiguity: Systems often default to static or under-specified expressions (e.g., "the elephant in the back") even with enforced motion-centric annotation protocols.
  • Lack of Explicit Motion Encoding: Vanilla cross-modal attention is insufficient for temporal or velocity disambiguation.
  • Discriminability: Models struggle when multiple objects exhibit similar appearance or trajectories.
  • No-Target Handling: The RMEG setup remains ill-posed in cases with no unambiguous referent, lacking abstain or “no-referring” outputs.
  • Physics and Robustness: Gesture models trained in synthetic/MoCap or VR scenes do not generalize to uncontrolled environments; real deployment would require robust 2D/3D object detection and adaptation to perception noise (Deichler et al., 6 Jul 2025, Ding et al., 11 Dec 2025).

6. Prospective Directions and Open Challenges

Recommendations for future research include:

  • Explicit Motion-Structure Modeling: Incorporate velocity, acceleration, and event-boundary signals—possibly via motion graphs or dynamic neural representations.
  • Contrastive Language Losses: To reward generation of disambiguating (e.g., sequence, directional) language reflecting referent uniqueness.
  • Join Perception and Generation: Models should “close the loop,” verifying output by re-grounding generated expressions back onto inputs.
  • Temporal Alignment Metrics: Go beyond nn-gram overlap to metrics such as average temporal IoU between described event intervals and ground-truth motion intervals.
  • Multimodal End-to-End Training and Differential Physics: For gesture, future work suggests learning directly from raw RGB-D and audio, leveraging differentiable simulators for stability and plausibility constraints.
  • Richer Gesture and Expression Taxonomies: Extend beyond pointing and basic deictic forms to iconic, metaphoric, and beat gestures, covering broader communicative intent (Deichler et al., 6 Jul 2025, Ding et al., 11 Dec 2025).

7. Significance and Broader Impact

RMEG stands as a pivotal testbed for grounded multimodal intelligence, requiring systems to bridge perceptual, geometric, and pragmatic language reasoning. The convergence of datasets such as those from (Deichler et al., 6 Jul 2025) and (Ding et al., 11 Dec 2025) provides the community with standardized, large-scale corpora and robust evaluation recipes. Progress in RMEG is likely to directly impact embodied conversational agents, robotic assistants operating in ambiguous, dynamic scenes, and next-generation video understanding—provided the above modeling and evaluation challenges can be addressed with more explicit, fine-grained, and context-aware approaches.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to MeViS Dataset.