Papers
Topics
Authors
Recent
2000 character limit reached

Hybrid Interaction Video Dataset

Updated 23 December 2025
  • Hybrid interaction video datasets are large-scale, multimodal corpora that capture complex behaviors among humans, objects, and agents using synchronized video, audio, and sensor data.
  • They employ advanced hardware setups and automated annotation pipelines to ensure high precision in motion capture, semantic labeling, and interaction segmentation.
  • These datasets drive research in embodied AI, human–object interactions, and video generation, supporting benchmarks for control policies and interactive model evaluation.

A hybrid interaction video dataset is a large-scale, multimodal corpus designed to capture, annotate, and benchmark complex interactive behaviors between entities—human, object, or agent—by synchronizing and integrating multiple sensing modalities (e.g., video, audio, motion capture, language, simulation logs). These datasets underpin research in embodied AI, human–object–human interaction modeling, video understanding, and generative modeling of behaviorally rich phenomena. Recent exemplars include InterVLA (Xu et al., 6 Aug 2025), SpeakerVid-5M (Zhang et al., 14 Jul 2025), ViMo (Yu et al., 11 Mar 2025), Hunyuan-GameCraft (Li et al., 20 Jun 2025), QUB-PHEO (Adebayo et al., 23 Sep 2024), and InterRVOS-8K (Jin et al., 3 Jun 2025). Collectively, they reflect methodological advances in data acquisition, semantic annotation, and benchmarking for hybrid, multimodal, and interaction-driven tasks.

1. Dataset Architectures, Modalities, and Acquisition Paradigms

Hybrid interaction video datasets employ a range of hardware configurations and annotation schemas to capture multimodal interaction episodes:

  • InterVLA integrates a hybrid RGB–MoCap system: 20 OptiTrack Prime 13 infrared cameras capture millimeter-precise human and object motion, five exocentric 1080p RGB video streams provide third-person views, and two egocentric GoPro Hero10 cameras (head, chest) supply first-person high-resolution streams. All streams are temporally aligned (< 1 ms).
  • SpeakerVid-5M sources 153,000 YouTube videos, stratifies them into four branches—monologue, dialogue (dyadic), listening, and multi-turn—capturing single/two-person interactional behaviors at 720p–1080p, with synchronized audio downsampled to 16 kHz. Preprocessing includes scene segmentation, diarization, face and hand keypoints, and alignment via SyncNet.
  • ViMo collects 3,500 video clips (derived from public and generated sources) paired with 3D reaction motions (real and synthetic), covering human–human, animal–human, and scene–human interactions, uniformly sampled and preprocessed for HumanML3D representation.
  • Hunyuan-GameCraft aggregates over 1,000,000 short gameplay recordings (across 100+ AAA titles), combining real gameplay (with 6-DoF camera/action logging) and synthetic sequences (rendered with fully known camera/motion trajectories). All modalities are precisely indexed (JSON) to enable synchronized autoregressive training and evaluation.
  • QUB-PHEO uses a 5-camera multi-view GoPro Hero10 rig at 4K/60 fps for dyadic assembly, with MediaPipe facial/hand/body landmarks, gaze estimation, object detection, and timestamped subtask labels, complemented by detailed camera calibration.
  • InterRVOS-8K employs automatic annotation pipelines (GPT-4o, LLaMA-Instruct) to produce dense referring expressions and interaction labels per video, with each object and interaction segment associated with high-quality segmentation masks for both actor and target roles.

2. Data Composition, Annotation, and Quality Control

Data schema in hybrid interaction video datasets is characterized by intricate multimodal linkages and fine-grained semantic supervision:

  • InterVLA encodes human motion in SMPL-formatted 6D pose, global orientation, joint positions, object pose/mesh, and temporally aligned language commands at the frame level. Scripts are GPT-generated, with each scenario involving multiple objects/furniture and 8±2 atomic commands.
  • SpeakerVid-5M annotates clips with multimodal features: face/hand blur scores (Laplacian variance), DOVER video quality, motion magnitude (Qwen2.5-VL), ASR confidence (Whisper), speaker identity (ArcFace), and structured captions. Dialogue pairs are explicitly marked as initiator→responder, enabling direct evaluation of bidirectional dyadic interaction.
  • ViMo supports paired video–motion supervision, with HumanML3D pose arrays including joint positions/velocities, root velocities, binary foot-contact flags, and emotion tags, permitting both action- and affect-conditioned generation.
  • Hunyuan-GameCraft stores per-frame 6-DoF extrinsics, keyboard/mouse actions normalized to a continuous 4D vector (d_trans, d_rot, α, β), and per-clip VLM-generated natural language descriptions. Synthetic sequences provide fully ground-truth control trajectories for development and benchmarking.
  • QUB-PHEO supplies dense per-frame pose (face, hand, upper-body), object detection (YOLOv8, test mAP ≈ 98%), 2D/3D gaze vectors, and 36-class subtask labels for every atomic assembly operation, all calibrated and enabled for 3D reconstruction.
  • InterRVOS-8K’s annotation pipeline yields both actor-centric and target-centric referring expressions for each interaction, with dense mask supervision and role-specific tokenization for actor/target in segmentation tasks.

Quality control typically involves automated and manual filters (e.g., for image blur, video fidelity, action plausibility), algorithmically enforced synchronization, and rigorous train/val/test splits to prevent subject or scenario overlap.

3. Benchmarks, Metrics, and Evaluation Protocols

Hybrid interaction video datasets define comprehensive benchmark suites to rigorously assess model performance in realistic multi-agent or multi-entity tasks:

Dataset Task Domain Key Metrics/Benchmarks
InterVLA (Xu et al., 6 Aug 2025) Egocentric action/motion, Text→Motion MPJPE, PA-MPJPE, PVE, Accel (body), R-Precision, FID, Diversity, contact accuracy
SpeakerVid-5M (Zhang et al., 14 Jul 2025) Dyadic video chat synthesis FID, FVD, ArcFace (identity), CLIP_dialog (dialogue coherence), SyncNet (AV sync), emotion FID, SIM-o (audio timbre)
ViMo (Yu et al., 11 Mar 2025) 3D human reaction generation FID (motion), Diversity, Multi-Modality
Hunyuan-GameCraft (Li et al., 20 Jun 2025) Game video w/ action history FVD, Dynamic optical flow, image/aesthetic/temporal scores, RPE (translation/rotation)
QUB-PHEO (Adebayo et al., 23 Sep 2024) HRI/Assembly intention inference Object detection mAP, subtask classification, gaze estimation
InterRVOS-8K (Jin et al., 3 Jun 2025) Referring video object segmentation Region similarity (J\mathcal{J}), Contour accuracy (F\mathcal{F}), Actor-Target combined, ablation on role modeling

For instance, InterVLA’s motion estimation is evaluated with MPJPE, PA-MPJPE, PVE, and Accel; its interaction synthesis relies on R-Precision, FID, and Diversity. SpeakerVid-5M’s VidChatBench examines video, audio, and semantic metrics for dyadic AV chat. ViMo leverages FID and Diversity for 3D reaction plausibility. Hunyuan-GameCraft uses FVD and action-aligned evaluation. InterRVOS-8K formalizes interaction segmentation with actor/target J\mathcal{J}, F\mathcal{F}, and combined scores.

4. Research Applications and Impact

Hybrid interaction video datasets underpin a variety of central research areas:

  • Egocentric perception and embodied action: InterVLA enables benchmarking of first-person visual grounding of physical interactions, crucial for in-hand manipulation, shared workspace collaboration, and assistive robotics.
  • Audio-visual conversational agents: SpeakerVid-5M facilitates end-to-end training of virtual humans and avatar-driven dialog systems robust to multi-turn, multimodal interaction, and OOD identity preservation.
  • Multi-type reaction modeling: ViMo supports investigation into cross-domain reaction generation (human–human, animal–human, scene–human) and emotion-conditioned behavior synthesis.
  • Controllable video generation and RL world-modeling: Hunyuan-GameCraft’s integration of action logs with high-fidelity video supports pretraining for RL/control policies, cinematic camera planning, and evaluation of fine-grained action controllability.
  • Intent and collaborative task recognition: QUB-PHEO’s multi-view, multi-label HRI annotation enables modeling of intention inference, predicting subtask transitions, and detailed study of gaze/engagement in dyadic collaboration.
  • Multimodal segmentation with interaction awareness: InterRVOS-8K and its ReVIOSa benchmark advance joint actor–target mask prediction in semantics-rich visual grounding, directly supporting multi-entity video understanding.

5. Usage Protocols, Preprocessing, and Access

Each dataset provides protocols to ensure reproducibility and ease of integration:

  • Data are often organized by session, interaction, or clip. Directory structures are standardized and splits (train/val/test) are held consistent across subtasks.
  • Preprocessing pipelines include spatial/temporal alignment, normalization (coordinate/frame rates), anonymization (face masking in InterVLA), and conversion to modeling-friendly formats.
  • Annotation tools and calibration pipelines are sometimes released alongside (e.g., QubVidCalib for QUB-PHEO).
  • Public access is mediated through dedicated project pages, licenses (often for non-commercial research), and supporting code for loading, preprocessing, and baseline benchmarking.

6. Limitations and Prospects for Extension

Hybrid interaction video datasets are subject to inherent limitations:

  • Modality gaps: Many egocentric datasets, such as InterVLA, lack explicit hand/finger MoCap and rely on post-hoc vision-based estimation.
  • Scope/breadth constraints: Datasets are often indoor/outdoor or domain-specific (e.g., indoor for InterVLA, AAA game focus for Hunyuan-GameCraft).
  • Annotation bottlenecks: Manual and semi-automatic pipelines may limit scale, diversity, or introduce systematic biases.
  • Action space coverage: Several datasets only track a subset of possible actions (e.g., Hunyuan-GameCraft is limited to camera/motion without discrete environmental interactions).
  • Representation bias: Datasets sampled from online video or game sources may skew toward prevalent genres, lighting, or social interaction modes.

Opportunities for extension include increasing temporal scale, expanding object/scene diversity, integrating higher-fidelity sensors (e.g., IMUs, gloves), and constructing richer multimodal/multiactor settings for next-generation embodied AI and video understanding tasks (Xu et al., 6 Aug 2025, Zhang et al., 14 Jul 2025, Yu et al., 11 Mar 2025, Li et al., 20 Jun 2025, Adebayo et al., 23 Sep 2024, Jin et al., 3 Jun 2025).

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Hybrid Interaction Video Dataset.