- The paper introduces a novel dataset with synchronized multi-view RGB-D and egocentric recordings that capture fine-grained bimanual actions and state annotations in industrial assembly.
- It details a comprehensive annotation hierarchy covering procedural steps, assembly states, and explicit anomaly and recovery labels to mirror real-world complexities.
- Benchmark results expose significant gaps in hand-specific action segmentation, cross-view retrieval, and forecasting, emphasizing challenges like occlusion and dynamic task execution.
IMPACT: A Dataset for Multi-Granularity Human Procedural Action Understanding in Industrial Assembly
Understanding procedural actions in industrial assembly presents challenges that are not addressed by prior datasets: severe occlusions, dynamic viewpoint shifts, multi-route execution, and anomaly-recovery cycles. Existing benchmarks largely focus on toy artifacts or simplified tasks, lack explicit state-level and anomaly annotations, and treat errors as side effects rather than explicit targets. In real industrial deployments, intelligent systems must reason jointly about procedural structure, compliance, state evolution, and recovery—under conditions of occlusion and partial observability. IMPACT introduces a dataset that directly addresses these requirements by providing synchronized multi-view RGB-D capture (one egocentric, four exocentric), explicit bimanual and state-level annotation, and anomaly-recovery supervision within a commercial angle grinder assembly workflow.
Figure 1: Overview of the IMPACT dataset and benchmark, including data collection setup, task taxonomy, and hierarchical annotation scheme for procedural actions, assembly states, and anomaly phases.
Dataset Construction and Annotation Hierarchy
IMPACT consists of 112 trials from 13 operators (4 experts, 9 novices), recording 39.5 hours across two commercial angle grinder models. Data acquisition employs a multi-modal setup: four network-synchronized Intel RealSense D455 RGB-D cameras at top, front, left, and right positions; one Tobii Pro Glasses 3 egocentric stream (paired with gaze and audio); and NASA-TLX cognitive load measures per trial. This ensures coverage of both perceptual and cognitive dimensions.
Annotation is systematically hierarchical:
- Fine-grained bimanual actions: 137 valid action classes (22 verbs × 19 nouns) per hand, capturing left/right hand coordination and decoupling stabilization from tool-driven manipulation.
- Procedural steps: 26 step categories for coarse segmentation and 51 for completion-event recognition, built upon a partial-order prerequisite graph to model real-world flexibility.
- Assembly states: 17 component instances with ternary annotations (misassembled, unassembled, correctly assembled) to structure state tracking.
- Phases: Explicit labeling of normal, anomaly, and recovery, with recovery annotated rather than inferred.
- Anomaly taxonomy: Six non-exclusive types (temporal, spatial, handling, wrong part, wrong tool, procedural), reflecting compound error modes.
Figure 2: Data acquisition and annotation pipeline, illustrating synchronized multi-view setup, protocol-driven recording sessions, and multi-stage annotation with expert validation.
Benchmark Suite: Multi-granularity Evaluation
IMPACT defines a unified benchmark spanning four task families, each mapping distinct deployment-oriented capabilities:
- Temporal Understanding: Action segmentation at step and fine-grained hand level (frame-wise Accuracy, Edit score, segmental F1@IoU).
- Cross-View Understanding: Instance alignment and semantic matching across synchronized views (Recall@K, Median Rank, Top-1/classification accuracy) to evaluate retrieval and compositional semantics under occlusion.
- Action Forecasting: Short-term anticipation (mean Top-5 Recall, hand-specific evaluation) and long-horizon forecasting (Edit Distance, Area Under Edit Distance, step prediction accuracy) over multi-route procedures.
- State Reasoning: Assembly state recognition (Macro-F1, Trans-F1, Final-State Accuracy), procedure step recognition (completion F1, detection delay, order similarity), procedural phase recognition (Accuracy/Macro-F1/anomaly/recovery F1), and anomaly type recognition (multi-label mAP). Split protocols control for label imbalance, subject variability, configuration generalization, and cross-view transfer.
Baseline Models and Evaluation
Baseline selection follows task structure and covers diverse architectural biases: sequence models (LTContext [bahrami2023much], ASQuery [gan2024asquery], DiffAct [liu2023diffusion], FACT [lu2024fact]), backbone encoders (I3D [carreira2017quo], VideoMAEv2 [wang2023videomaev2], MViTv2 [li2022mvitv2]), forecasting models (AVT [girdhar2021anticipative], ScalAnt [zhong2026scalable]), vision-LLMs (Gemini 3.1 Pro [team2023gemini], Qwen3VL-8B [bai2025qwen3], AntGPT [qi2024antgpt], PALM [kim2024palm]), and specialized procedural step recognizers. All models are evaluated with standardized splits to control for confounds in trial assignment and task leakage.
Numerical Results and Key Findings
IMPACT's benchmarks expose several persistent, deployment-critical gaps:
- Observability gap: Moving from coarse procedural segmentation to hand-specific atomic action segmentation results in a 2.6× drop in F1@50, with systematic left/right hand asymmetry—dominant hand manipulation is easier to segment, while stabilization poses greater temporal persistence but weaker visual cueing.
- Cross-view limitation: Instance-level alignment lags semantic retrieval (MViTv2 global CV-TA R@1 at 1.40 vs CV-SM retrieval R@1 at 6.24), quantifying the deficiency in view-invariant representation learning under occlusion. Exocentric-to-egocentric transfer collapse (R@1 falls by 30–60%) highlights evidence structurally missing from the egocentric perspective.
- Forecasting ceiling: Bimanual prediction precision diverges (ScalAnt+I3D left 18.36 vs right 15.70 mR@5); object class anticipation is appearance-predictable but manipulation intent is not. Graph-structural ambiguity dominates long-horizon action forecasting—Markov chain step prediction outperforms all visual models, and LLM augmentation yields no improvement unless step recognition is perfect.
- Knowledge-execution gap: Vision-LLMs (Qwen3VL-8B, Gemini 3.1 Pro) lead in order similarity but incur more than double the detection delay of discriminative baselines; strong procedural priors are not grounded in temporal completion events.
- Operational bottlenecks: Recovery phase recognition is nearly unresolved across all models (left hand F1_rec ~0, right hand below 7), with rare-phase imbalance (recovery spans 2.22% of frames) being a primary limiting factor rather than model capacity.
Implications and Future Directions
IMPACT radically reframes the procedural action understanding problem: real-world deployment demands joint reasoning over actions, states, and anomalies/recovery—under the intersection of occlusion, flexible execution, and incomplete observations. The egocentric observability gap, forecasting ceiling, and knowledge-execution gap all localize at transitions, anomalies, and recovery, directly impacting critical system performance.
IMPACT enables measurable progress in:
- Cross-view temporal and semantic representation learning robust to occlusion and missing evidence.
- Graph-aware long-horizon forecasting and planning in non-linear industrial SOPs, beyond appearance- and language-driven priors.
- Explicit anomaly/recovery detection and compliance monitoring integrated with state trajectory evolution.
- Human-centric procedural intelligence incorporating cognitive load, operator metadata, and multimodal signals (audio, gaze).
The convergence of challenges at operationally critical moments mandates not more powerful single-task models but truly joint, unified approaches spanning multiple reasoning axes. Future advances in AI for embodied industrial settings will require comprehensive models capable of structured reasoning and adaptive correction, informed by benchmarks like IMPACT.
Conclusion
IMPACT constitutes the first dataset to jointly deliver synchronized ego–exo RGB-D capture, multi-granularity bimanual annotation, explicit compliance-aware state tracking, and anomaly–recovery supervision in an authentic commercial assembly workflow (2604.10409). Its benchmark surfaces challenges invisible to prior resources and provides the foundation for deployment-grade procedural intelligence. Progress on IMPACT will directly impact key areas in action segmentation, cross-view representation learning, procedural video understanding, anomaly detection, and human–robot collaboration. The dataset, codebase, and full annotation suite are publicly released for reproducible research.