HARMONIC Dataset: Human–Robot Shared Autonomy

Updated 31 January 2026

HARMONIC Dataset is a comprehensive multimodal corpus capturing synchronized signals from human, robot, and environment, designed for studying shared autonomy in assistive tasks.
It integrates diverse sensor modalities such as eye tracking, video, joystick, EMG, and robot state to enable precise intent inference and cognitive state assessment.
The dataset features 480 high-fidelity trials with controlled assistance levels and rigorous synchronization techniques, making it ideal for advanced human–robot interaction research.

The HARMONIC Dataset is a comprehensive multimodal corpus focused on human–robot collaboration in shared autonomy scenarios, specifically targeting assistive eating tasks using a 6 degree-of-freedom (DOF) robotic arm. Encompassing synchronized human, robot, and environment data from 24 participants, the dataset has been assembled to enable in-depth study of intent prediction, cognitive state modeling, and the dynamics of shared human–robot control. All primary signals—including eye gaze, egocentric and third-person video, joystick control, electromyography (EMG), and full robot state—are time-aligned and accompanied by derived features (body pose, hand pose, facial landmarks), providing a rich substrate for machine learning and human–robot interaction (HRI) research (Newman et al., 2018).

1. Experiment Design and Task Protocol

The experimental paradigm is an assistive eating task: each participant is seated before three marshmallows arranged on a plate and operates a Kinova Mico robotic arm via a 2-axis joystick with three discrete mode switches (x–y, z–yaw, pitch–roll). The protocol involves two principal stages per trial: user teleoperation positions the forked arm above a chosen morsel followed by an autonomous fork “spearing” and serving action, triggered by a long-press mode switch.

Crucially, shared autonomy is instantiated as a POMDP framework over a finite goal set $G = \{g_1, g_2, g_3\}$ , corresponding to the morsels. Online intent inference maintains a belief $b(g)$ , used to blend the human joystick input $u$ and the robot's computed assistive action $a$ via

$a_\mathrm{applied} = (1-\gamma) u + \gamma a,$

with $\gamma \in [0,1]$ controlling autonomy level (teleoperation: $\gamma=0$ , low: $0.33$, high: $0.67$, autonomous: $1.0$). Each of 24 naïve participants performs 5 trials at all 4 assistance levels, yielding 480 total trials and about 5 hours of recorded multimodal data.

2. Sensor Modalities and Feature Set

The dataset captures multimodal streams with high temporal precision and spatial accuracy for each trial:

Binocular Eye Tracking: Pupil Labs near-IR dark-pupil system (120 Hz, 640×480 px), with synchronized raw pupil center extraction, per-eye confidence scores, and manual AprilTag-based calibration for egocentric gaze mapping.
Egocentric (Scene) Camera: Pupil Labs RGB at 30 Hz (1280×720 px), with timestamped frame indexing.
Third-person Stereo Video: Stereolabs ZED (left/right, 1920×1080 px, 30 Hz) for whole-body movement capture; no published calibration, but rectification via ZED SDK possible.
Joystick Control: Real-time logging of raw x/y axis inputs, mode state, and assistance information at nominal 120 Hz; full-resampled to a uniform time grid.
Surface Electromyography (EMG): Myo armband, 8 channels (50 Hz), with concurrent IMU (accelerometer, gyroscope, quaternion orientation), present in 21% of trials with >99% coverage when available.
Robot State: Kinova Mico 6-DOF joint positions and velocities (80 Hz); derived forward-kinematics Cartesian positions of all links, with $T_{0\to6}(\theta)$ and $p_\mathrm{tool}$ available at each timestamp.
Shared Autonomy Metadata: Assistance blending coefficients, inferred POMDP goal belief distributions $b(g_i)$ , and applied assistive twist $a$ .
Environmental Ground Truth: Homogeneous transforms (AprilTag markers) giving the global position of each morsel in the robot frame.

Derived high-level features incorporate:

2D human pose (25 OpenPose joints), left/right hand (21 keypoints each), facial landmarks (70 points), computed offline from ZED video streams.

3. Data Formats, Organization, and Synchronization

Directory hierarchy is organized per participant and run, with strict subfolder separation:

pXXX/ (participant root)
- calib/: camera and pupil calibration CSVs
- check/: inter-block calibration verification
- run/run_NNN/
- text_data/: CSV/YAML for all raw and processed streams (e.g., gaze_positions.csv, ada_joy.csv, joint_states.csv, robot_position.csv, EMG, pose, assistance info)
- videos/: all original and processed MP4s, timestamp arrays (*_timestamps.npy)
- stats/: YAML with nominal frequency, frame drops, coverage stats
- processed/: overlays and derived feature streams

Synchronization leverages nanosecond-precision timestamps and two video index mappings: world_index and world_index_corrected for accurate alignment across asynchronous streams. All major time series are aligned to scene video, with code templates for resampling, frame extraction, and overlay.

Example headers (abridged):

File	Key Columns/sample
gaze_positions.csv	timestamp, norm_pos_x, norm_pos_y, confidence, world_index, world_index_corrected
ada_joy.csv	timestamp, mode, joy_x, joy_y
myo_emg.csv	timestamp, emg0–emg7
joint_states.csv	timestamp, joint_i_pos, joint_i_vel
pose.csv	timestamp, joint_1_x, joint_1_y, ..., joint_25_x, joint_25_y

4. Experiment Coverage and Data Quality

Population: 24 participants (13 female, ages 18–45), all non-expert with respect to robotics and teleoperation.
Trial Structure: 4 assistance levels × 5 trials per level × 24 participants = 480 trials.
Coverage: Full video and robot/joystick state in all trials; eye-tracking and EMG have ≤1% frame loss where present. EMG present in approximately 21% of trials due to initialization failures (when present, >99% temporal coverage).
Frame Drop Monitoring: Per-run YAML stats document expected vs. actual data frames and dropped indices; coverage typically ≈95%.

All data signals are intended for high-fidelity multimodal behavioral analysis, with careful timestamping and alignment mechanisms to permit cross-modal integration.

5. Example Code and User Guidance

To support downstream analysis, practical code fragments are provided. Typical pre-processing or alignment tasks include:

import pandas as pd, numpy as np, cv2
gaze = pd.read_csv('text_data/gaze_positions.csv')
world_ts = np.load('videos/world_timestamps.npy')
vid = cv2.VideoCapture('videos/world.mp4')
for idx,row in gaze.iterrows():
    f = int(row['world_index_corrected'])
    vid.set(cv2.CAP_PROP_POS_FRAMES, f)
    ret,frame = vid.read()
    x_px = int(row['norm_pos_x'] * frame.shape[1])
    y_px = int(row['norm_pos_y'] * frame.shape[0])
    cv2.circle(frame,(x_px,y_px),5,(0,0,255),-1)

For resampling:

t0 = world_ts[0]
t_common = t0 + np.arange(0,int((world_ts[-1]-t0)/1e9*30))*1e9/30
emg = pd.read_csv('text_data/myo_emg.csv')
emg['timestamp'] = emg['timestamp'].astype(np.int64)
emg.set_index('timestamp',inplace=True)
emg_rs = emg.reindex(t_common, method='nearest')

Best practices include strict usage of original nanosecond timestamps for all cross-modal synchronizations and forward-filling control/EMG modalities to mask gaps.

6. Research Applications and Baseline Results

The HARMONIC Dataset is designed for several core research uses:

Intention Prediction: Fusion of gaze and EMG predicts goal probabilities in shared autonomy POMDP belief update frameworks.
Human Policy Modeling: Data-driven characterization of human adaptation under different autonomy levels.
Cognitive State Assessment: Pupil dynamics in teleoperation are leveraged for cognitive load estimation.
Learning Eye–Hand–Control Couplings: Modeling and imitation learning of tightly coupled eye, hand, and control device signals for assistive robotics.

Prior analysis using subsets of HARMONIC demonstrated intention inference accuracy ≈85% (three-way goal prediction) and F1 ≈0.75 for manipulation error detection from gaze signals.

7. Access, Licensing, and Community Use

Multiple dataset subsets are provided for ease of access:

Full dataset: ~68 GB (harmonic_data.tar.gz)
Minimal (CSV+video+stats): ~15 GB (harmonic_minimal.tar.gz)
Text only: ~4 GB (harmonic_text.tar.gz)
Single-participant sample: ~300 MB

The data is publicly available at http://harp.ri.cmu.edu/harmonic. No specific licensing information is present in the primary publication, but data is supplied in standard, human-readable formats for broad reuse.

The HARMONIC Dataset constitutes an unparalleled resource for empirical HRI research, enabling fine-grained analysis and modeling of shared autonomy, intent inference, and behavioral coordination in the context of assistive robotics (Newman et al., 2018).

Markdown Report Issue Upgrade to Chat

References (1)

HARMONIC: A Multimodal Dataset of Assistive Human-Robot Collaboration (2018)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HARMONIC Dataset.