ForceVLA2-Dataset: Hybrid Force-Position Control

Updated 3 July 2026

The dataset provides synchronized multi-modal streams (RGB, proprioception, force-torque) for robust hybrid force–position control.
It comprises 1,000 real-robot demonstration trajectories with detailed subtask annotations and stage-level force-aware prompts.
ForceVLA2 enables evaluation of VLA models, showing up to 66% success rates over position-only approaches in complex tasks.

ForceVLA2-Dataset is a multi-modal, real-robot dataset expressly constructed for end-to-end learning of hybrid force–position control in contact-rich manipulation tasks. It provides synchronized streams of multi-view RGB images, proprioceptive states, force-torque measurements, comprehensive subtask annotations, and stage-level force-aware prompts, supporting the training and evaluation of force-aware vision-language-action (VLA) models. The dataset was introduced alongside the ForceVLA2 framework to address limitations in position-only or naïvely force-augmented VLA architectures, establishing a foundation for robust, closed-loop physical intelligence in complex manipulation scenarios (Li et al., 16 Mar 2026).

1. Dataset Composition and Scope

ForceVLA2-Dataset comprises 1,000 real-robot demonstration trajectories, totaling approximately 500,000 time-steps. Each demonstration captures the complete execution of a contact-rich task, segmented into 3–5 discrete stages (subtasks) based on salient transitions in force-torque signals and visual inspection. The five canonical tasks included are:

Press bottle (vertical actuation of a pump)
Clean vase (surface wiping to remove stains)
Clean board (erasing chalk from a blackboard)
Retrieve plate (searching and grasping a plate occluded within a foam sandbox)
Assemble gears (aligning and inserting mechanical gears)

Each task is deliberately designed to require significant contact dynamics and hybrid force–position control. Every demonstration in the dataset is a successful human-teleoperated run; failed trials are not included in the dataset but are addressed in evaluation of learned policies.

2. Sensor Modalities and Data Acquisition

The dataset integrates multiple sensory streams synchronized via a common clock:

Visual. Three RGB camera streams (no depth): two static third-person views (Intel RealSense D455, 1280×720 px @ 30 Hz) and a wrist-mounted egocentric perspective (Intel RealSense D435, 640×480 px @ 30 Hz). Each frame is preprocessed (resized to 480×640, pixel-value normalized) and precisely timestamp-aligned.
Proprioceptive. End-effector 6D pose $p(t) \in \mathbb{R}^6$ (Cartesian position and orientation as quaternion), logged at 300 Hz, with joint positions/velocities recorded synchronously.
Force/Torque. A 6-axis force-torque sensor at the end-effector flange, capturing $f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ (channels: $[F_x, F_y, F_z, \tau_x, \tau_y, \tau_z]$ ) at 300 Hz. Force signals are bias-corrected and normalized per axis (±100 N for forces, ±15 N·m for torques).

All streams are timestamped and may be interpolated onto a shared grid (typically matching image frames) to ensure sample-accurate multimodal alignment.

3. Structure, Organization, and Annotation

Data for each trajectory is stored under a directory structure

$[F_x, F_y, F_z, \tau_x, \tau_y, \tau_z]$ 1

The contents include:

images/: Synchronized RGB frames for each camera as PNGs, indexed by timestamp.
sensors/:
- state.csv: Contains timestamp, Cartesian pose, quaternion orientation, and all joint angles.
- force.csv: Timestamped 6-axis force-torque readings.
annotations/:
- task_prompt.txt: Natural-language description of the primary task.
- force_prompts.json: Per-stage textual cues encoding current subtask and expected contact (used for prompt engineering with the VLM expert).
- subtask_labels.csv: Start/end timestamps for each stage/subtask.

Timestamps are UNIX-style or relative in milliseconds, supporting direct cross-modal association.

4. Annotation Methodology and Progress Metrics

Subtasks are annotated by transitions in the norm of the raw force signal $\|f_{\mathrm{raw}}(t)\|$ , corroborated by visual cues. For each stage, discrete stage IDs (typically 3–5 per trajectory) specify intervals of coherent contact dynamics, such as "approach," "make contact," "manipulate," and "retract."

A probabilistic model is employed to quantify subtask progress, defined as:

Let

$\Theta = (1 - |\langle E_{\mathrm{current}}, E_{\mathrm{target}} \rangle| / (\|E_{\mathrm{current}}\| \|E_{\mathrm{target}}\|)) \in (0,1)$ (directional deviation)
$L = \|p_{\mathrm{target}} - p_{\mathrm{current}}\| \in [0, \infty)$ (spatial distance)
$F = \|f_{\mathrm{raw}}\| \in [n, m]$ (force envelope)

Assuming $\Theta \sim \operatorname{Beta}(\alpha, 1)$ , $L \sim \operatorname{Exp}(\lambda)$ , $F \sim \operatorname{Uniform}(n, m)$ , the joint event for stage completion probability $f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 0 is:

$f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 1

with $f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 2 in the released implementation [(Li et al., 16 Mar 2026), Eq. 10]. This probabilistic definition supports both annotation (offline segmentation) and real-time learning of progress-aware policies.

5. Technical Specifications and Coordinate Relations

Force and torque signals are zeroed and normalized for uniformity across demonstrations. The coordinate mapping between end-effector wrenches and joint torques follows the standard Jacobian transpose law:

$f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 3

where $f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 4 is the manipulator Jacobian and $f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 5 is the spatial wrench. This enables platform-agnostic interpretation and facilitates policy transfer to other robot arms with known kinematics, assuming access to $f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 6 and compatible sensor streams.

6. Benchmarks, Evaluation, and Baselines

The published experiments do not prescribe a fixed train/val/test split; all 1,000 trajectories are used for offline policy learning, with evaluation performed on 20 independent real-robot trials per task. The main evaluation metric is success rate (%) by task, defined as completion of the nominal objective under specified constraints. Baseline results across all tasks are as follows:

Model Variant	Success Rate (%)
$f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 7 (position-only)	18.0
$f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 8 (co-trained)	31.0
ACP (admittance)	16.0
$f_{\mathrm{raw}}(t) \in \mathbb{R}^6$ 9 + force concat	17.0
ForceVLA	35.0
ForceVLA2	66.0

Task-level rates for ForceVLA2 compared to $[F_x, F_y, F_z, \tau_x, \tau_y, \tau_z]$ 0 indicate consistent, substantial gains in all scenarios (e.g., "assemble gears": 70% vs. 0%, "retrieve plate": 35% vs. 0%). This improvement is attributed to closed-loop hybrid force–position control and explicit force-aware concept fusion enabled by the dataset's structure and annotation (Li et al., 16 Mar 2026).

7. Significance and Empirical Impact

ForceVLA2-Dataset provides a thorough empirical foundation for training and evaluating force-aware VLA models in contact-rich settings. By offering synchronized, richly annotated multimodal streams—including stage-level force prompts, visually and mechanically segmented subtasks, and complete proprioceptive/force logs—it enables systematic study of hybrid control, subtask transition modeling, and multi-stage manipulation. The explicit per-sample alignment and hardware-agnostic mapping further support transfer to diverse robotic platforms.

The adoption of ForceVLA2-Dataset has advanced the empirical study of physically intelligent agents, directly facilitating improvements in stability, precision, robustness, and reliability of closed-loop manipulation, while mitigating failure modes such as arm overload and unstable contact that remain problematic for position-only and naïvely augmented architectures (Li et al., 16 Mar 2026).

Markdown Report Issue Upgrade to Chat

References (1)

ForceVLA2: Unleashing Hybrid Force-Position Control with Force Awareness for Contact-Rich Manipulation (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ForceVLA2-Dataset.

ForceVLA2-Dataset: Hybrid Force-Position Control

1. Dataset Composition and Scope

2. Sensor Modalities and Data Acquisition

3. Structure, Organization, and Annotation

4. Annotation Methodology and Progress Metrics

5. Technical Specifications and Coordinate Relations

6. Benchmarks, Evaluation, and Baselines

7. Significance and Empirical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

ForceVLA2-Dataset: Hybrid Force-Position Control

1. Dataset Composition and Scope

2. Sensor Modalities and Data Acquisition

3. Structure, Organization, and Annotation

4. Annotation Methodology and Progress Metrics

5. Technical Specifications and Coordinate Relations

6. Benchmarks, Evaluation, and Baselines

7. Significance and Empirical Impact

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research