ForceVLA2-Dataset: Hybrid Force-Position Control
- The dataset provides synchronized multi-modal streams (RGB, proprioception, force-torque) for robust hybrid force–position control.
- It comprises 1,000 real-robot demonstration trajectories with detailed subtask annotations and stage-level force-aware prompts.
- ForceVLA2 enables evaluation of VLA models, showing up to 66% success rates over position-only approaches in complex tasks.
ForceVLA2-Dataset is a multi-modal, real-robot dataset expressly constructed for end-to-end learning of hybrid force–position control in contact-rich manipulation tasks. It provides synchronized streams of multi-view RGB images, proprioceptive states, force-torque measurements, comprehensive subtask annotations, and stage-level force-aware prompts, supporting the training and evaluation of force-aware vision-language-action (VLA) models. The dataset was introduced alongside the ForceVLA2 framework to address limitations in position-only or naïvely force-augmented VLA architectures, establishing a foundation for robust, closed-loop physical intelligence in complex manipulation scenarios (Li et al., 16 Mar 2026).
1. Dataset Composition and Scope
ForceVLA2-Dataset comprises 1,000 real-robot demonstration trajectories, totaling approximately 500,000 time-steps. Each demonstration captures the complete execution of a contact-rich task, segmented into 3–5 discrete stages (subtasks) based on salient transitions in force-torque signals and visual inspection. The five canonical tasks included are:
- Press bottle (vertical actuation of a pump)
- Clean vase (surface wiping to remove stains)
- Clean board (erasing chalk from a blackboard)
- Retrieve plate (searching and grasping a plate occluded within a foam sandbox)
- Assemble gears (aligning and inserting mechanical gears)
Each task is deliberately designed to require significant contact dynamics and hybrid force–position control. Every demonstration in the dataset is a successful human-teleoperated run; failed trials are not included in the dataset but are addressed in evaluation of learned policies.
2. Sensor Modalities and Data Acquisition
The dataset integrates multiple sensory streams synchronized via a common clock:
- Visual. Three RGB camera streams (no depth): two static third-person views (Intel RealSense D455, 1280×720 px @ 30 Hz) and a wrist-mounted egocentric perspective (Intel RealSense D435, 640×480 px @ 30 Hz). Each frame is preprocessed (resized to 480×640, pixel-value normalized) and precisely timestamp-aligned.
- Proprioceptive. End-effector 6D pose (Cartesian position and orientation as quaternion), logged at 300 Hz, with joint positions/velocities recorded synchronously.
- Force/Torque. A 6-axis force-torque sensor at the end-effector flange, capturing (channels: ) at 300 Hz. Force signals are bias-corrected and normalized per axis (±100 N for forces, ±15 N·m for torques).
All streams are timestamped and may be interpolated onto a shared grid (typically matching image frames) to ensure sample-accurate multimodal alignment.
3. Structure, Organization, and Annotation
Data for each trajectory is stored under a directory structure
1
The contents include:
images/: Synchronized RGB frames for each camera as PNGs, indexed by timestamp.sensors/:state.csv: Contains timestamp, Cartesian pose, quaternion orientation, and all joint angles.force.csv: Timestamped 6-axis force-torque readings.
annotations/:task_prompt.txt: Natural-language description of the primary task.force_prompts.json: Per-stage textual cues encoding current subtask and expected contact (used for prompt engineering with the VLM expert).subtask_labels.csv: Start/end timestamps for each stage/subtask.
Timestamps are UNIX-style or relative in milliseconds, supporting direct cross-modal association.
4. Annotation Methodology and Progress Metrics
Subtasks are annotated by transitions in the norm of the raw force signal , corroborated by visual cues. For each stage, discrete stage IDs (typically 3–5 per trajectory) specify intervals of coherent contact dynamics, such as "approach," "make contact," "manipulate," and "retract."
A probabilistic model is employed to quantify subtask progress, defined as:
Let
- (directional deviation)
- (spatial distance)
- (force envelope)
Assuming , , , the joint event for stage completion probability 0 is:
1
with 2 in the released implementation [(Li et al., 16 Mar 2026), Eq. 10]. This probabilistic definition supports both annotation (offline segmentation) and real-time learning of progress-aware policies.
5. Technical Specifications and Coordinate Relations
Force and torque signals are zeroed and normalized for uniformity across demonstrations. The coordinate mapping between end-effector wrenches and joint torques follows the standard Jacobian transpose law:
3
where 4 is the manipulator Jacobian and 5 is the spatial wrench. This enables platform-agnostic interpretation and facilitates policy transfer to other robot arms with known kinematics, assuming access to 6 and compatible sensor streams.
6. Benchmarks, Evaluation, and Baselines
The published experiments do not prescribe a fixed train/val/test split; all 1,000 trajectories are used for offline policy learning, with evaluation performed on 20 independent real-robot trials per task. The main evaluation metric is success rate (%) by task, defined as completion of the nominal objective under specified constraints. Baseline results across all tasks are as follows:
| Model Variant | Success Rate (%) |
|---|---|
| 7 (position-only) | 18.0 |
| 8 (co-trained) | 31.0 |
| ACP (admittance) | 16.0 |
| 9 + force concat | 17.0 |
| ForceVLA | 35.0 |
| ForceVLA2 | 66.0 |
Task-level rates for ForceVLA2 compared to 0 indicate consistent, substantial gains in all scenarios (e.g., "assemble gears": 70% vs. 0%, "retrieve plate": 35% vs. 0%). This improvement is attributed to closed-loop hybrid force–position control and explicit force-aware concept fusion enabled by the dataset's structure and annotation (Li et al., 16 Mar 2026).
7. Significance and Empirical Impact
ForceVLA2-Dataset provides a thorough empirical foundation for training and evaluating force-aware VLA models in contact-rich settings. By offering synchronized, richly annotated multimodal streams—including stage-level force prompts, visually and mechanically segmented subtasks, and complete proprioceptive/force logs—it enables systematic study of hybrid control, subtask transition modeling, and multi-stage manipulation. The explicit per-sample alignment and hardware-agnostic mapping further support transfer to diverse robotic platforms.
The adoption of ForceVLA2-Dataset has advanced the empirical study of physically intelligent agents, directly facilitating improvements in stability, precision, robustness, and reliability of closed-loop manipulation, while mitigating failure modes such as arm overload and unstable contact that remain problematic for position-only and naïvely augmented architectures (Li et al., 16 Mar 2026).