Papers
Topics
Authors
Recent
2000 character limit reached

Refer360: Multimodal HRI Dataset

Updated 13 December 2025
  • Refer360 is a multimodal, multi-view human–robot interaction dataset designed to capture natural verbal and nonverbal behaviors for embodied referring expression comprehension.
  • It comprises nearly 14,000 interactions with synchronized exocentric and egocentric streams including RGB, depth, IR, audio, gaze, skeletal, and IMU data collected in diverse environments.
  • The dataset supports advanced tasks such as predicting object bounding boxes and visual question answering, with baseline improvements demonstrated using guided fusion models like MuRes.

Refer360 is a large-scale, multimodal, and multi-view human–robot interaction (HRI) dataset designed to advance research in embodied referring expression comprehension. It addresses key shortcomings of prior corpora—including perspective bias, limited gesture coverage, single-view collection, and an overrepresentation of scripted indoor scenes—by capturing natural verbal and nonverbal object-referencing behaviors from diverse perspectives in both laboratory and real-world environments. Refer360 comprises approximately 14,000 synchronized interactions, each involving speech, gesture, gaze, and motion, with comprehensive annotations for downstream machine learning tasks. The dataset serves as both a benchmark and a research substrate for embodied language understanding, multimodal fusion, and robotic perception (Islam et al., 6 Dec 2025).

1. Scale, Modalities, and Environmental Coverage

Refer360 contains 13,990 referring-expression interactions spanning 3.2 million frames recorded over 17.62 hours. Each sample includes multi-perspective synchronized streams:

  • Exocentric RGB video (Azure Kinect DK)
  • Egocentric RGB video and gaze/pupil tracking (Pupil Invisible eye tracker)
  • Depth video (Azure Kinect time-of-flight)
  • Infrared (IR) video
  • Audio (Kinect microphone array)
  • 3D skeletal joints (32-joint skeleton)
  • Inertial measurement unit (accelerometer + gyroscope)

Every interaction is captured from at least three temporally synchronized viewpoints: exocentric RGB (“exo”), egocentric RGB/gaze (“ego”), and exocentric depth (additionally providing IR and skeleton data aligned to that frame). Each frame and event is timestamped with UNIX time, enabling precise cross-modal and cross-view alignment.

The corpus spans 392 recording sessions: 198 indoor (“laboratory”) sessions (10,814 interactions, 2.47 million frames, 13.48 hours) and 194 outside-lab (homes, outdoor, workplace) sessions (3,176 interactions, 759,000 frames, 4.14 hours). The dataset encompasses 66 participants (mean age 26.66 ± 3.36 years, 53% male, 47% female), all operating an Ohmni telepresence robot equipped with the recording apparatus and following both constrained (explicit verbal and gestural instructions) and unconstrained (natural instruction) conditions. Post-task survey results show 96.97% of participants preferred a multimodal (speech + gesture) referencing style, while only 3.03% used speech alone; none relied solely on gestures.

2. Data Collection, Synchronization, and Annotation Workflow

Data acquisition involves a coordinated hardware setup: the Azure Kinect DK is mounted on a teleoperated Ohmni mobile robot to simulate robot’s viewpoint, while participants wear the Pupil Invisible eye tracker to record egocentric video and gaze. Data streams are captured using custom Python software (pyKinectAzure SDK for Kinect, Real-Time Python API for Pupil Labs).

Recording and segmentation are event-driven: keyboard signals (“Space” for start/end of interaction, “G” for canonical reference event, “Q” for session termination) are synchronized across all streams, enabling post hoc alignment via UNIX timestamps.

The annotation pipeline comprises:

  • Audio transcription via OpenAI Whisper, validated by five expert annotators
  • Referent target bounding boxes in exocentric RGB, as well as labeling of multimodal cues (pointing onsets, gaze fixations)
  • Perspective labels (“speaker-centric” vs. “robot-centric”)
  • Manual verification of all segmentations and transcriptions

No inter-annotator agreement statistics are reported. Preprocessing uses FFmpeg to separate RGB, depth, and IR streams; extracts MP3 audio; and segments video and skeletal data into per-interaction clips and canonical-frame images for focused model supervision.

3. Data Organization, Formats, and Access Patterns

The dataset follows a session-centric directory hierarchy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
refer360/
  session_<ID>/
    metadata.json
    audio.mp3
    transcription.txt
    imu_accel.json
    imu_gyro.json
    skeleton.json
    Videos/
      exo.mp4
      ego.mp4
      depth.mp4
      infrared.mp4
    Frames/
      interactions/
        interaction_0001/
          exo_frame_0001.png
          ego_frame_0001.png
          ...
        ...
      canonical_frames/
        ...

Media files use MP4 (H.264) for video, MP3 for audio, and JSON for structured sensor data. Each session includes all primary data streams, transcriptions, and per-interaction or canonical frame snapshots. The public dataset does not enforce a standard split; experiments in the original paper used random-seed splits, but users are encouraged to define splits by session or participant (e.g., 70/15/15%) for systematic benchmarking.

4. Supported Research Tasks and Evaluation Benchmarks

The principal benchmark task is Embodied Referring Expression Comprehension, formulated as: given an interaction’s multimodal data, predict the 2D bounding box of the referred object in the exocentric RGB frame. The learning signal is L2 regression on box coordinates:

Lbb=1Nn=1Nb^nbn2\mathcal{L}_{bb} = \frac{1}{N} \sum_{n=1}^{N} \|\hat{b}_n - b_n\|^2

Performance is evaluated using Intersection over Union (IoU) at multiple thresholds:

IoU(b^,b)=b^bb^b\mathrm{IoU}( \hat{b}, b ) = \frac{ | \hat{b} \cap b | }{ | \hat{b} \cup b | }

with reported metrics IoU-25 and IoU-50.

Visual Question Answering (VQA) is also supported using ScienceQA and A-OKVQA formats (multiple choice, cross-entropy loss, top-1 accuracy). Typical losses and metrics are:

Lce=nynlogpn Acc=1Nn=1N1{y^n=yn}\mathcal{L}_{ce} = -\sum_{n} y_n \log p_n \ \mathrm{Acc} = \frac{1}{N} \sum_{n=1}^{N} \mathbb{1}\{ \hat{y}_n = y_n \}

5. Baseline Methods and the MuRes Adapter

Baseline models incorporate leading vision-language (VL) architectures: CLIP, Dual-Encoder, ViLT, and BLIP-2 for detection; VisualBERT, CLIP, ViLT, and Dual-Encoder for VQA. Feature fusion is examined across four variants: direct (no residual), vanilla residual (additive skip-connection), and guided residual—MuRes—applied on visual (“MuRes(V)”), language (“MuRes(L)”), or both tokens (“MuRes(V+L)”).

Sample detection results on Refer360 highlight the performance improvements with MuRes: | Model | IoU-25 | IoU-50 | |----------------------|--------|--------| | CLIP | 25.80% | 7.67% | | + MuRes(V) | 29.20% | 9.15% | | ViLT | 36.53% | 14.03% | | + MuRes(V+L) | 37.05% | 14.66% | | Dual-Encoder | 31.08% | 9.83% | | + MuRes(V+L) | 31.08% | 10.68% |

Similarly, VQA results on ScienceQA reveal substantial gains: | Model | Accuracy | |----------------------|----------| | CLIP | 21.31% | | + MuRes(V+L) | 51.85% | | VisualBERT | 34.95% | | + MuRes(V+L) | 39.03% | | ViLT | 44.52% | | + MuRes(V+L) | 49.33% |

These results assert that off-the-shelf VL backbones underachieve on embodied reference tasks, with MuRes providing consistent improvements via guided reinforcement of salient modality-specific features.

6. Design Rationale, Limitations, and Recommendations

Refer360 directly addresses gaps in prior datasets: lack of viewpoint diversity, scripted and exclusively indoor environments, limited gesture and gaze annotation, and a predominance of speaker-centric language. Its multi-view, multi-modality, and broad environmental coverage reduce perspective and viewpoint biases, supporting more ecologically valid HRI modeling.

Empirical findings emphasize that humans overwhelmingly rely on multimodal referencing, integrating speech, gesture, and gaze; as such, models must adopt fusion strategies beyond mere alignment of generic features—dedicated architecture for modality-specific cues is required.

The absence of prescribed train/val/test splits may impede systematic benchmarking; the authors recommend defining splits by environment or subject identity. Future extensions could include embodied QA (E-QA), multimodal dialogue, comprehensive 360° perception (e.g., with multiple Kinects), and modular adapter-based fine-tuning for integrating large VL models with lightweight HRI modules.

This suggests that Refer360 will serve as both a challenging benchmark and a toolkit for modeling real-world, multimodal human–robot communication, applicable to a broad spectrum of embodied AI tasks (Islam et al., 6 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Refer360 Dataset.