Papers
Topics
Authors
Recent
Search
2000 character limit reached

ForceVLA2-Dataset: Hybrid Force-Position Control

Updated 3 July 2026
  • The dataset provides synchronized multi-modal streams (RGB, proprioception, force-torque) for robust hybrid force–position control.
  • It comprises 1,000 real-robot demonstration trajectories with detailed subtask annotations and stage-level force-aware prompts.
  • ForceVLA2 enables evaluation of VLA models, showing up to 66% success rates over position-only approaches in complex tasks.

ForceVLA2-Dataset is a multi-modal, real-robot dataset expressly constructed for end-to-end learning of hybrid force–position control in contact-rich manipulation tasks. It provides synchronized streams of multi-view RGB images, proprioceptive states, force-torque measurements, comprehensive subtask annotations, and stage-level force-aware prompts, supporting the training and evaluation of force-aware vision-language-action (VLA) models. The dataset was introduced alongside the ForceVLA2 framework to address limitations in position-only or naïvely force-augmented VLA architectures, establishing a foundation for robust, closed-loop physical intelligence in complex manipulation scenarios (Li et al., 16 Mar 2026).

1. Dataset Composition and Scope

ForceVLA2-Dataset comprises 1,000 real-robot demonstration trajectories, totaling approximately 500,000 time-steps. Each demonstration captures the complete execution of a contact-rich task, segmented into 3–5 discrete stages (subtasks) based on salient transitions in force-torque signals and visual inspection. The five canonical tasks included are:

  • Press bottle (vertical actuation of a pump)
  • Clean vase (surface wiping to remove stains)
  • Clean board (erasing chalk from a blackboard)
  • Retrieve plate (searching and grasping a plate occluded within a foam sandbox)
  • Assemble gears (aligning and inserting mechanical gears)

Each task is deliberately designed to require significant contact dynamics and hybrid force–position control. Every demonstration in the dataset is a successful human-teleoperated run; failed trials are not included in the dataset but are addressed in evaluation of learned policies.

2. Sensor Modalities and Data Acquisition

The dataset integrates multiple sensory streams synchronized via a common clock:

  • Visual. Three RGB camera streams (no depth): two static third-person views (Intel RealSense D455, 1280×720 px @ 30 Hz) and a wrist-mounted egocentric perspective (Intel RealSense D435, 640×480 px @ 30 Hz). Each frame is preprocessed (resized to 480×640, pixel-value normalized) and precisely timestamp-aligned.
  • Proprioceptive. End-effector 6D pose p(t)R6p(t) \in \mathbb{R}^6 (Cartesian position and orientation as quaternion), logged at 300 Hz, with joint positions/velocities recorded synchronously.
  • Force/Torque. A 6-axis force-torque sensor at the end-effector flange, capturing fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^6 (channels: [Fx,Fy,Fz,τx,τy,τz][F_x, F_y, F_z, \tau_x, \tau_y, \tau_z]) at 300 Hz. Force signals are bias-corrected and normalized per axis (±100 N for forces, ±15 N·m for torques).

All streams are timestamped and may be interpolated onto a shared grid (typically matching image frames) to ensure sample-accurate multimodal alignment.

3. Structure, Organization, and Annotation

Data for each trajectory is stored under a directory structure

[Fx,Fy,Fz,τx,τy,τz][F_x, F_y, F_z, \tau_x, \tau_y, \tau_z]1

The contents include:

  • images/: Synchronized RGB frames for each camera as PNGs, indexed by timestamp.
  • sensors/:
    • state.csv: Contains timestamp, Cartesian pose, quaternion orientation, and all joint angles.
    • force.csv: Timestamped 6-axis force-torque readings.
  • annotations/:
    • task_prompt.txt: Natural-language description of the primary task.
    • force_prompts.json: Per-stage textual cues encoding current subtask and expected contact (used for prompt engineering with the VLM expert).
    • subtask_labels.csv: Start/end timestamps for each stage/subtask.

Timestamps are UNIX-style or relative in milliseconds, supporting direct cross-modal association.

4. Annotation Methodology and Progress Metrics

Subtasks are annotated by transitions in the norm of the raw force signal fraw(t)\|f_{\mathrm{raw}}(t)\|, corroborated by visual cues. For each stage, discrete stage IDs (typically 3–5 per trajectory) specify intervals of coherent contact dynamics, such as "approach," "make contact," "manipulate," and "retract."

A probabilistic model is employed to quantify subtask progress, defined as:

Let

  • Θ=(1Ecurrent,Etarget/(EcurrentEtarget))(0,1)\Theta = (1 - |\langle E_{\mathrm{current}}, E_{\mathrm{target}} \rangle| / (\|E_{\mathrm{current}}\| \|E_{\mathrm{target}}\|)) \in (0,1) (directional deviation)
  • L=ptargetpcurrent[0,)L = \|p_{\mathrm{target}} - p_{\mathrm{current}}\| \in [0, \infty) (spatial distance)
  • F=fraw[n,m]F = \|f_{\mathrm{raw}}\| \in [n, m] (force envelope)

Assuming ΘBeta(α,1)\Theta \sim \operatorname{Beta}(\alpha, 1), LExp(λ)L \sim \operatorname{Exp}(\lambda), FUniform(n,m)F \sim \operatorname{Uniform}(n, m), the joint event for stage completion probability fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^60 is:

fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^61

with fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^62 in the released implementation [(Li et al., 16 Mar 2026), Eq. 10]. This probabilistic definition supports both annotation (offline segmentation) and real-time learning of progress-aware policies.

5. Technical Specifications and Coordinate Relations

Force and torque signals are zeroed and normalized for uniformity across demonstrations. The coordinate mapping between end-effector wrenches and joint torques follows the standard Jacobian transpose law:

fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^63

where fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^64 is the manipulator Jacobian and fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^65 is the spatial wrench. This enables platform-agnostic interpretation and facilitates policy transfer to other robot arms with known kinematics, assuming access to fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^66 and compatible sensor streams.

6. Benchmarks, Evaluation, and Baselines

The published experiments do not prescribe a fixed train/val/test split; all 1,000 trajectories are used for offline policy learning, with evaluation performed on 20 independent real-robot trials per task. The main evaluation metric is success rate (%) by task, defined as completion of the nominal objective under specified constraints. Baseline results across all tasks are as follows:

Model Variant Success Rate (%)
fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^67 (position-only) 18.0
fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^68 (co-trained) 31.0
ACP (admittance) 16.0
fraw(t)R6f_{\mathrm{raw}}(t) \in \mathbb{R}^69 + force concat 17.0
ForceVLA 35.0
ForceVLA2 66.0

Task-level rates for ForceVLA2 compared to [Fx,Fy,Fz,τx,τy,τz][F_x, F_y, F_z, \tau_x, \tau_y, \tau_z]0 indicate consistent, substantial gains in all scenarios (e.g., "assemble gears": 70% vs. 0%, "retrieve plate": 35% vs. 0%). This improvement is attributed to closed-loop hybrid force–position control and explicit force-aware concept fusion enabled by the dataset's structure and annotation (Li et al., 16 Mar 2026).

7. Significance and Empirical Impact

ForceVLA2-Dataset provides a thorough empirical foundation for training and evaluating force-aware VLA models in contact-rich settings. By offering synchronized, richly annotated multimodal streams—including stage-level force prompts, visually and mechanically segmented subtasks, and complete proprioceptive/force logs—it enables systematic study of hybrid control, subtask transition modeling, and multi-stage manipulation. The explicit per-sample alignment and hardware-agnostic mapping further support transfer to diverse robotic platforms.

The adoption of ForceVLA2-Dataset has advanced the empirical study of physically intelligent agents, directly facilitating improvements in stability, precision, robustness, and reliability of closed-loop manipulation, while mitigating failure modes such as arm overload and unstable contact that remain problematic for position-only and naïvely augmented architectures (Li et al., 16 Mar 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ForceVLA2-Dataset.