HARMONIC Dataset: Human–Robot Shared Autonomy
- HARMONIC Dataset is a comprehensive multimodal corpus capturing synchronized signals from human, robot, and environment, designed for studying shared autonomy in assistive tasks.
- It integrates diverse sensor modalities such as eye tracking, video, joystick, EMG, and robot state to enable precise intent inference and cognitive state assessment.
- The dataset features 480 high-fidelity trials with controlled assistance levels and rigorous synchronization techniques, making it ideal for advanced human–robot interaction research.
The HARMONIC Dataset is a comprehensive multimodal corpus focused on human–robot collaboration in shared autonomy scenarios, specifically targeting assistive eating tasks using a 6 degree-of-freedom (DOF) robotic arm. Encompassing synchronized human, robot, and environment data from 24 participants, the dataset has been assembled to enable in-depth study of intent prediction, cognitive state modeling, and the dynamics of shared human–robot control. All primary signals—including eye gaze, egocentric and third-person video, joystick control, electromyography (EMG), and full robot state—are time-aligned and accompanied by derived features (body pose, hand pose, facial landmarks), providing a rich substrate for machine learning and human–robot interaction (HRI) research (Newman et al., 2018).
1. Experiment Design and Task Protocol
The experimental paradigm is an assistive eating task: each participant is seated before three marshmallows arranged on a plate and operates a Kinova Mico robotic arm via a 2-axis joystick with three discrete mode switches (x–y, z–yaw, pitch–roll). The protocol involves two principal stages per trial: user teleoperation positions the forked arm above a chosen morsel followed by an autonomous fork “spearing” and serving action, triggered by a long-press mode switch.
Crucially, shared autonomy is instantiated as a POMDP framework over a finite goal set , corresponding to the morsels. Online intent inference maintains a belief , used to blend the human joystick input and the robot's computed assistive action via
with controlling autonomy level (teleoperation: , low: $0.33$, high: $0.67$, autonomous: $1.0$). Each of 24 naïve participants performs 5 trials at all 4 assistance levels, yielding 480 total trials and about 5 hours of recorded multimodal data.
2. Sensor Modalities and Feature Set
The dataset captures multimodal streams with high temporal precision and spatial accuracy for each trial:
- Binocular Eye Tracking: Pupil Labs near-IR dark-pupil system (120 Hz, 640×480 px), with synchronized raw pupil center extraction, per-eye confidence scores, and manual AprilTag-based calibration for egocentric gaze mapping.
- Egocentric (Scene) Camera: Pupil Labs RGB at 30 Hz (1280×720 px), with timestamped frame indexing.
- Third-person Stereo Video: Stereolabs ZED (left/right, 1920×1080 px, 30 Hz) for whole-body movement capture; no published calibration, but rectification via ZED SDK possible.
- Joystick Control: Real-time logging of raw x/y axis inputs, mode state, and assistance information at nominal 120 Hz; full-resampled to a uniform time grid.
- Surface Electromyography (EMG): Myo armband, 8 channels (50 Hz), with concurrent IMU (accelerometer, gyroscope, quaternion orientation), present in 21% of trials with >99% coverage when available.
- Robot State: Kinova Mico 6-DOF joint positions and velocities (80 Hz); derived forward-kinematics Cartesian positions of all links, with and available at each timestamp.
- Shared Autonomy Metadata: Assistance blending coefficients, inferred POMDP goal belief distributions , and applied assistive twist .
- Environmental Ground Truth: Homogeneous transforms (AprilTag markers) giving the global position of each morsel in the robot frame.
Derived high-level features incorporate:
- 2D human pose (25 OpenPose joints), left/right hand (21 keypoints each), facial landmarks (70 points), computed offline from ZED video streams.
3. Data Formats, Organization, and Synchronization
Directory hierarchy is organized per participant and run, with strict subfolder separation:
pXXX/(participant root)calib/: camera and pupil calibration CSVscheck/: inter-block calibration verificationrun/run_NNN/text_data/: CSV/YAML for all raw and processed streams (e.g.,gaze_positions.csv,ada_joy.csv,joint_states.csv,robot_position.csv, EMG, pose, assistance info)videos/: all original and processed MP4s, timestamp arrays (*_timestamps.npy)stats/: YAML with nominal frequency, frame drops, coverage statsprocessed/: overlays and derived feature streams
Synchronization leverages nanosecond-precision timestamps and two video index mappings: world_index and world_index_corrected for accurate alignment across asynchronous streams. All major time series are aligned to scene video, with code templates for resampling, frame extraction, and overlay.
Example headers (abridged):
| File | Key Columns/sample |
|---|---|
| gaze_positions.csv | timestamp, norm_pos_x, norm_pos_y, confidence, world_index, world_index_corrected |
| ada_joy.csv | timestamp, mode, joy_x, joy_y |
| myo_emg.csv | timestamp, emg0–emg7 |
| joint_states.csv | timestamp, joint_i_pos, joint_i_vel |
| pose.csv | timestamp, joint_1_x, joint_1_y, ..., joint_25_x, joint_25_y |
4. Experiment Coverage and Data Quality
- Population: 24 participants (13 female, ages 18–45), all non-expert with respect to robotics and teleoperation.
- Trial Structure: 4 assistance levels × 5 trials per level × 24 participants = 480 trials.
- Coverage: Full video and robot/joystick state in all trials; eye-tracking and EMG have ≤1% frame loss where present. EMG present in approximately 21% of trials due to initialization failures (when present, >99% temporal coverage).
- Frame Drop Monitoring: Per-run YAML stats document expected vs. actual data frames and dropped indices; coverage typically ≈95%.
All data signals are intended for high-fidelity multimodal behavioral analysis, with careful timestamping and alignment mechanisms to permit cross-modal integration.
5. Example Code and User Guidance
To support downstream analysis, practical code fragments are provided. Typical pre-processing or alignment tasks include:
1 2 3 4 5 6 7 8 9 10 11 |
import pandas as pd, numpy as np, cv2 gaze = pd.read_csv('text_data/gaze_positions.csv') world_ts = np.load('videos/world_timestamps.npy') vid = cv2.VideoCapture('videos/world.mp4') for idx,row in gaze.iterrows(): f = int(row['world_index_corrected']) vid.set(cv2.CAP_PROP_POS_FRAMES, f) ret,frame = vid.read() x_px = int(row['norm_pos_x'] * frame.shape[1]) y_px = int(row['norm_pos_y'] * frame.shape[0]) cv2.circle(frame,(x_px,y_px),5,(0,0,255),-1) |
1 2 3 4 5 6 |
t0 = world_ts[0] t_common = t0 + np.arange(0,int((world_ts[-1]-t0)/1e9*30))*1e9/30 emg = pd.read_csv('text_data/myo_emg.csv') emg['timestamp'] = emg['timestamp'].astype(np.int64) emg.set_index('timestamp',inplace=True) emg_rs = emg.reindex(t_common, method='nearest') |
Best practices include strict usage of original nanosecond timestamps for all cross-modal synchronizations and forward-filling control/EMG modalities to mask gaps.
6. Research Applications and Baseline Results
The HARMONIC Dataset is designed for several core research uses:
- Intention Prediction: Fusion of gaze and EMG predicts goal probabilities in shared autonomy POMDP belief update frameworks.
- Human Policy Modeling: Data-driven characterization of human adaptation under different autonomy levels.
- Cognitive State Assessment: Pupil dynamics in teleoperation are leveraged for cognitive load estimation.
- Learning Eye–Hand–Control Couplings: Modeling and imitation learning of tightly coupled eye, hand, and control device signals for assistive robotics.
Prior analysis using subsets of HARMONIC demonstrated intention inference accuracy ≈85% (three-way goal prediction) and F1 ≈0.75 for manipulation error detection from gaze signals.
7. Access, Licensing, and Community Use
Multiple dataset subsets are provided for ease of access:
- Full dataset: ~68 GB (
harmonic_data.tar.gz) - Minimal (CSV+video+stats): ~15 GB (
harmonic_minimal.tar.gz) - Text only: ~4 GB (
harmonic_text.tar.gz) - Single-participant sample: ~300 MB
The data is publicly available at http://harp.ri.cmu.edu/harmonic. No specific licensing information is present in the primary publication, but data is supplied in standard, human-readable formats for broad reuse.
The HARMONIC Dataset constitutes an unparalleled resource for empirical HRI research, enabling fine-grained analysis and modeling of shared autonomy, intent inference, and behavioral coordination in the context of assistive robotics (Newman et al., 2018).