L2-ARCTIC: Bimanual Interaction Dataset

Updated 26 November 2025

L2-ARCTIC is a large-scale dataset designed for capturing two-hand, real-object manipulation with exhaustive 3D annotations.
It comprises 2.1 million video frames from multiple RGB views and high-fidelity MoCap data, detailing hand, body, and object states.
The dataset supports tasks like Consistent Motion Reconstruction and Interaction Field Estimation, featuring quantitative metrics for dynamic interaction analysis.

The ARCTIC dataset is a large-scale human hand-object interaction corpus designed for analyzing dexterous bimanual manipulation of articulated objects. It provides a unique multimodal benchmark for spatio-temporally consistent, physically grounded reconstruction of hand–object states and contacts in the context of two-handed, real-object manipulation, with exhaustive 3D annotation at the frame level and detailed metrics for interaction understanding (Fan et al., 2022).

1. Dataset Composition and Statistics

ARCTIC consists of video sequences capturing ten subjects (five female, five male) as they interact with everyday articulated objects using both hands. The capture system records synchronized multi-view RGB and high-fidelity motion capture (MoCap) data. In total, the dataset comprises approximately 2.1 million video frames distributed over 339 interaction sequences (∼34 per subject) and covering 11 distinct object classes, including notebook, box, espresso machine, waffle iron, laptop, phone, capsule machine, mixer, ketchup bottle, scissors, and microwave.

Two interaction intents are defined per object: “Use," in which subjects articulate the object, and “Grasp," in which they hold but do not manipulate articulation. The frame breakdown by intent and object is summarized as follows:

Object	"Use" (k)	"Grasp" (k)	Total (k)
Notebook	163	27	190
Box	152	36	187
Scissors	128	44	172
Microwave	144	43	187
...	...	...	...
Total	1,700	400	2,100

Video data is captured at 60 Hz from eight static (allocentric/third-person) RGB views plus a moving egocentric (first-person) view, with a native image resolution of 2800×2000 px. Each RGB frame is precisely aligned with 3D marker measurements from a 54-camera Vicon MoCap system, providing per-frame 3D annotation.

2. 3D Annotation Pipeline

The annotation pipeline leverages marker-based MoCap and mesh-model fitting for joint 3D estimation of full-body, hand, and object states, as well as contact.

Hand and Body Modeling:

Hand articulation is parameterized using the MANO model, with pose parameters $\theta \in \mathbb{R}^{48}$ and shape parameters $\beta \in \mathbb{R}^{10}$ , producing hand meshes $H(\theta,\beta) \in \mathbb{R}^{778 \times 3}$ . Full-body pose is parameterized using the SMPL-X model.

Object Representation:

Objects are scanned into watertight mesh models, split into rigid base and lid parts, each articulated along a single rotational degree of freedom (1-DoF). Object pose is given by $\Omega = (R_o, T_o, \omega)$ , with $R_o \in SO(3)$ , $T_o \in \mathbb{R}^3$ , and $\omega \in \mathbb{R}$ (articulation angle). Object reposing is $O(\Omega) \in \mathbb{R}^{V \times 3}$ .

Marker-to-Mesh Fitting:

Spherical (2 mm radius) markers are placed on the skin, object, and body to align mesh models to 3D MoCap data. Fitting is performed via least-squares minimization against observed marker locations: $\arg \min_{\theta, \beta, R, T} \sum_k \|x_k - (R \cdot V_k(\theta,\beta) + T)\|^2$ for body and hand, and

$E_{obj}(R_o, T_o, \omega) = \sum_j \| Y_j - (R_o \cdot f_{artic}(X_j, \omega) + T_o) \|^2$

for object meshes.

Axis Estimation:

The axis of object articulation is estimated by calibrating over the range of the moving part, fitting planar circle trajectories to marker paths, projecting to 3D, then recovering axis parameters $(v, v_0)$ and articulation angles $\omega_t$ via nonlinear least squares: $(v^*, v_0^*, \{\omega_t^*\}) = \arg\min_{v, v_0, \{\omega_t\}} \sum_{t,i} \|\bar X_t^i - \mathrm{Rotate}(\bar X_{t_0}^i; v, v_0, \omega_t)\|^2$

Contact Annotation:

Contact is defined at the mesh-vertex level using proximity criteria based on the GRAB protocol. Binary labels assign contact for under-shooting (vertex within 3 mm of object) or over-shooting (vertex penetration). Dense signed distances $F^{a \rightarrow b}_i = \min_{j}\|v^a_i - v^b_j\|$ are annotated for each hand/object vertex pair to provide scalar “interaction fields.”

3. Benchmark Tasks and Evaluation Metrics

ARCTIC supports two benchmark tasks: Consistent Motion Reconstruction and Interaction Field Estimation, both addressing per-frame as well as spatio-temporal consistency.

3.1 Consistent Motion Reconstruction:

The objective is to reconstruct the 3D pose of both MANO hands and the articulated object from a monocular video stream. Metrics include:

Smoothness prior:

$L_{temp} = \sum_{t=2}^T \|P_t - P_{t-1}\|^2$

where $P_t$ is the concatenated pose (both hands and object) at frame $t$ .

Contact Deviation (CDev):

$\mathrm{CDev} = \frac{1}{C} \sum_{i=1}^C \|\hat h_i - \hat o_i\|$

measuring reconstructed hand–object separation at ground-truth contact locations.

Motion Deviation (MDev):

$\mathrm{MDev} = \frac{1}{n-m} \sum_{t=m+1}^n \|\delta \hat h_i^t - \delta \hat o_j^t\|$

comparing relative hand–object motions during articulated contact windows.

Acceleration Error (ACC):

$\mathrm{ACC} = \frac{1}{TV}\sum_{t=2}^{T-1}\sum_{i=1}^V \| \hat u_i^t - u_i^t \|$

with $u_i^t = (h_i^{t-1} - 2h_i^t + h_i^{t+1}) / \Delta t^2$ .

Standard metrics, including Mean Per Joint Position Error (MPJPE) for hand keypoints, Average Articulation Error (AAE) in $\omega$ angle, and object vertex success rates, are also supported.

3.2 Interaction Field Estimation:

The goal is to estimate per-vertex shortest distances $F^{a \rightarrow b}$ between hand and object mesh surfaces from RGB alone. Metrics:

Mean Absolute Error (MAE):

$\mathrm{MAE} = \frac{1}{V_a} \sum_{i=1}^{V_a} | F^{a \rightarrow b}_i - \hat F^{a \rightarrow b}_i |$

ACC (field):

Acceleration error on predicted field sequences.

4. Baseline Models and Results

Two baseline architectures are proposed: ARCTICNet for motion reconstruction and InterField for distance field estimation. Both are implemented in single-frame (SF) and recurrent (LSTM) variants.

Model (Variant)	CDev (mm)	MRRPE (mm)	MDev (mm)	ACC (m/s²)	MPJPE (mm)	AAE (°)	Success Rate (%)	MAE $_{dist}$ (mm)	ACC $_{field}$ (m/s²)
ARCTICNet-LSTM	38.9	49.2/37.7	9.3	5.0/6.1	21.5	5.2	73.5	–	–
InterField-LSTM	–	–	–	–	–	–	–	8.7 / 9.1	1.9 / 1.9

Architectural Summary:

ARCTICNet uses a ResNet-50 backbone (2048-D features); MLP hand and object branches iteratively refine MANO/SMPL-X and object parameters; LSTM variant aggregates features across an 11-frame window.
InterField uses the same backbone, projects to 512-D, concatenates with sub-sampled mesh vertices, passes through PointNet for per-point inference, and upsamples scalar outputs to full meshes; LSTM aggregates features temporally.

Training objectives combine multiple losses, including 2D/3D joint error, articulation regression, contact deviation, and acceleration, employing MSE and $L_2$ distance. For InterField, a masked $L_1$ loss on per-vertex fields is applied, thresholded at 0.10 m for tractability.

5. Dataset Capabilities and Limitations

Unique Capabilities

ARCTIC is the first real-world, large-scale dataset focused on two-hand, dexterous, bimanual manipulation of articulated objects, with the following properties:

339 free-form sequences (~23 s each) captured at 60 Hz with 9 synchronized views at 2800×2000 px.
Full body (SMPL-X), hand (MANO), and object mesh reconstructions per frame.
Dynamic, per-contact binary and dense field annotations supporting spatio-temporal interaction modeling.
Enables evaluation of physically consistent, multi-object, 3D hand-object interactions, advancing beyond existing grasp or static pose datasets (Fan et al., 2022).

Annotated Coverage Gaps and Future Extensions

Constraints identified include:

Objects are limited to 1-DoF hinged/rotating parts; future work may expand to multi-DoF, deformable, or shape-varying categories.
Recordings are confined to a single laboratory environment with constant illumination; introducing varied backgrounds, lighting conditions, or in-the-wild/mobile capture is suggested for broader generalization.
All baselines assume known object meshes; learning at the category level or from unknown/novel objects remains an open challenge.
Body/hand meshes are rigid; contact-driven deformation modeling (for skin, etc.) could be explored using the provided marker and image data.

6. Scientific Context and Significance

ARCTIC enables a new level of quantitative analysis for physically consistent spatio-temporal hand-object reconstruction. By providing high-fidelity bimanual sequences with mesh-level annotation and contact, ARCTIC forms a benchmark for evaluating both geometric and interactional aspects of 3D scene understanding, with potential implications for robotics, graphics, computational neuroscience, and embodied AI (Fan et al., 2022). The dataset’s rigorous annotation pipeline, comprehensive task definitions, and metric suite set a new standard for dexterous hand–object interaction research.

PDF Markdown Chat (Pro)

References (1)

ARCTIC: A Dataset for Dexterous Bimanual Hand-Object Manipulation (2022)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to L2-ARCTIC Dataset.