MANIAC: Hand–Object Interaction Dataset

Updated 17 December 2025

MANIAC dataset is a publicly available corpus for fine-grained analysis of atomic hand–object interactions with clear states like approaching, grabbing, and holding.
It processes synchronized RGB video and dense segmentation masks into fixed-length statistical–kinematic vectors over short temporal windows.
The dataset enables benchmarking with state-of-the-art classifiers, achieving up to 97.60% accuracy and a grabbing F1-score of 0.90.

The MANIAC dataset is a publicly available corpus designed for fine-grained analysis of atomic hand–object interactions. Originating from recordings of human subjects manipulating everyday objects, it provides a rich foundation for the study and classification of elementary interaction states such as “approaching,” “grabbing,” and “holding.” The dataset encompasses synchronized RGB video and per-frame segmentation masks, enabling the extraction of structured statistical–kinematic feature vectors over short temporal windows. The MANIAC dataset has been central to recent benchmarking efforts, notably for the development and evaluation of lightweight classifiers and interpretable machine learning architectures for hand–object interaction recognition (Movahed et al., 10 Dec 2025).

1. Source Data and Interaction Taxonomy

The MANIAC recordings are composed of sessions in which volunteers manipulate diverse everyday objects—such as bottles, mugs, and screwdrivers—recorded by a fixed RGB camera at 30 Hz with 640 × 480 resolution. Each frame is accompanied by dense segmentation masks delineating both the hand and object. Interaction episodes are segmented into short "windows" comprising 10 keyframes, with action labels determined by the atomic state present at the 11th frame. The defined states are:

Approaching: The hand moves toward the object (support = 377 samples in the test set).
Grabbing: The transitional, near-contact state (support = 260).
Holding: The object is stably grasped (support = 2,774).
Releasing and Unknown: Auxiliary states (405 and 306, respectively).

Table 1: Class distribution in the held-out test set (4,122 windows)

State	Support (samples)
Approaching	377
Grabbing	260
Holding	2,774
Releasing	405
Unknown	306

These distributions follow a typical long-tail pattern, where the steady-state classes predominate and transient “grabbing” is scarce, emphasizing the challenge of modeling brief transitional phenomena.

2. Feature Engineering and Data Processing Pipeline

A structured six-stage pipeline converts each 10-keyframe window into a single fixed-length statistical–kinematic vector, suitable for machine learning. The stages are as follows:

Keyframe Extraction: Salient frames are selected using the Laplacian variance ( $\sigma_L$ ) and frame-difference energy ( $E_d$ ), retaining those exceeding manually chosen thresholds ( $\tau_L$ , $\tau_d$ ).
Sliding Predictive Window: Input comprises $N=10$ consecutive keyframes, with labels assigned based on the atomic interaction at keyframe $N+1$ .
Kinematic and Relational Descriptor Computation:
- Proximity: The minimum Euclidean distance ( $d_t$ ) between hand and object, computed via the distance transform of the object mask.
- Hand Dynamics: Velocity ( $v_t$ ) and acceleration ( $a_t$ ) of the 2D hand centroid, with windowed mean ( $\mu_v$ ), standard deviation ( $\sigma_v$ ), and analogous measures for acceleration.
- Contact Metrics: Binary contact signal ( $c_t = [d_t < \epsilon]$ , $\epsilon = 10$ px), total contact count, and maximal duration of contiguous contact.
Statistical Aggregation: For sequences $\{d_t\}$ , $\{\|v_t\|\}$ , $\{\|a_t\|\}$ , and $\{c_t\}$ , the mean, variance, and slope of the trend are computed over $t = 1 \ldots N$ .
Normalization and Filtering: All scalars are z-normalized on the training set only; no additional temporal smoothing is applied, as no empirical benefit was observed.
Feature Concatenation: The process yields a single vector (dimension $D \approx 50$ –$60$), aggregating all descriptors for classification tasks.

Repeating this pipeline across all raw MANIAC video yields a dataset of 27,476 labeled vectors, each summarizing approximately 0.25–0.5 s of interaction.

3. Data Storage, Partitioning, and Access

The feature corpus is distributed in a stratified 80/20 train–test split, preserving the rarity of transition classes. Arrays are stored as NumPy ".npz" archives with the following structure:

Directory	Contents
mani_ac_features/train	features_00.npz (“X”: [N,D], “y”: [N,])
mani_ac_features/test	features_test.npz

Labels are integer-encoded: 0 (approaching), 1 (grabbing), 2 (holding), 3 (releasing), 4 (unknown). A Python script (“load_dataset.py”) automates loading, normalization, and TensorFlow dataset conversion:

1 2	data = load_npz("mani_ac_features/train/features_00.npz") X_train, y_train = data["X"], data["y"]

4. Benchmarking and Model Evaluation

An 8–fold cross-validation on the training partition (using tools such as KerasTuner and Optuna) is employed for hyperparameter search and selection. Several classifier paradigms are compared:

Static MLPs: Achieve ≈89.45% accuracy; perform poorly on “grabbing” (F1 ≈ 0.45).
Standard RNNs (seq_length=5): Similar overall performance ( $≈$ 89%) and F1 ≈ 0.55 on “grabbing.”
Bidirectional RNN (seq_length=1): Repurposed as a “static encoder,” this model achieves a marked improvement:
- Overall accuracy: 97.60%
- Weighted F1-score: 0.98
- Grabbing F1-score: 0.90 (Precision = 0.90, Recall = 0.90)

The findings indicate that the handcrafted feature windows provide sufficient temporal context; repurposing a gated RNN as a static encoder best exploits this representation and achieves a new state-of-the-art on atomic hand–object interaction recognition (Movahed et al., 10 Dec 2025).

5. Limitations and Best Practices

Important caveats and recommendations are as follows:

Class Imbalance: “Grabbing” constitutes less than 5% of samples; class weighting (“class_weight=’balanced’”) is used, and data augmentation is indicated for future work.
Recording Modality: Single-view RGB and masks inherently limit occlusion robustness; incorporation of multi-view or depth data would enhance performance.
Feature Engineering Constraints: The focus on handcrafted descriptors may omit subtle spatial features; transition to end-to-end learned (e.g., GNN-based) representations is suggested for capturing finer relational dynamics.

Recommended usage protocols include retaining stratified data splits, applying z-normalization based solely on training data, evaluating both static (seq_length=1 RNN) and true temporal models, and using interpretability techniques (e.g., SHAP) to analyze which features the model leverages.

6. Impact and Future Directions

By curating and transforming the raw MANIAC video and mask data into 27,476 structured statistical–kinematic vectors, this benchmark establishes a new reference point (97.60% accuracy; grabbing F1 = 0.90) for recognition of fundamental hand–object interaction states. The pipeline’s interpretable descriptors and lightweight model design advance objectives in both transparency and efficiency.

A plausible implication is that, for atomic interaction classification, handcrafted temporal summaries encode the salient structural information, obviating the need for sequential RNN modeling in the network itself. Future research directions include addressing class imbalance, extending to multi-modal or multi-view datasets, and integrating learned spatial–relational representations for nuanced interaction understanding (Movahed et al., 10 Dec 2025).