Papers
Topics
Authors
Recent
2000 character limit reached

CaptainCook4D: Egocentric Cooking Dataset

Updated 2 December 2025
  • CaptainCook4D is a large-scale, multi-modal egocentric dataset with dense, hierarchical annotations and explicit error labeling for cooking procedures.
  • It synchronizes data from HoloLens2 and GoPro cameras to benchmark tasks like error recognition, temporal localization, and live guidance using supervised and weakly supervised learning.
  • Graph-based modeling techniques applied to the dataset enable precise temporal action localization, zero-shot error detection, and effective instructional feedback.

CaptainCook4D is a large-scale, multi-modal egocentric dataset specifically designed to advance research in procedural activity understanding, error recognition, temporal localization, and graph-based modeling of real-world cooking tasks. It provides dense, hierarchical annotations at both the step and fine-grained action level, with explicit error labeling and comprehensive benchmarks for supervised and weakly supervised learning. The dataset is foundational for evaluating and developing neural architectures oriented toward live guidance, error feedback, and structured understanding of complex sequential activities.

1. Dataset Scope and Acquisition

CaptainCook4D was constructed from 384 egocentric video recordings spanning 94.5 hours, involving eight participants who performed 24 cooking procedures in natural kitchens (Peddi et al., 2023). Each session was instrumented with HoloLens2 (capturing RGB, depth, hand and head poses, IMU data at 30 Hz) and head-mounted GoPro Hero 11 (4K RGB video at 30 fps), producing synchronized multi-modal streams. Participants followed two experimental conditions: strict adherence to recipe steps (“normal” runs) or deliberate deviation to induce errors (“error” runs), resulting in 173 error-free and 211 error-inclusive sessions.

The dataset includes:

  • 5,300 step-level segments (mean ≈13.8 per session, ≈53 seconds per step).
  • 10,000 fine-grained action-level segments annotated on 20% of sessions.
  • Seven error categories: Preparation, Measurement, Technique, Timing, Temperature, Missing-step, and Order-error.
  • Step annotations referencing recipe-specific directed acyclic graphs (DAGs) encoding valid ordering constraints.

Each recording is aligned to a specific WikiHow-style recipe, enabling analysis across a taxonomy of cooking sub-tasks such as “Chop Tomato,” “Add Salt,” and more.

2. Annotation Schema and Benchmarks

The step-level annotation schema assigns temporal start/end boundaries to each procedural step, with optional linkage to its associated DAG node. On the action level, annotators segment each recorded fine-grained motor act, providing temporal boundaries for millisecond-precise analysis. When errors occur, annotators select one of the seven error categories, leveraging a pre-defined taxonomy and mind-maps to ensure consistency.

Benchmarks established on CaptainCook4D are as follows (Peddi et al., 2023, Bhattacharyya et al., 27 Nov 2025):

  • Supervised Error Recognition: Binary classification on trimmed video segments (step-level), using embeddings from advanced video backbones (e.g., Omnivore, VideoMAE) and a lightweight MLP classifier. Binary cross-entropy loss is employed, and quantitative metrics include F1-score and AUC.
  • Multi-Step Localization: Temporal action localization on untrimmed video, aiming to detect all step instances (center, duration, class label). Proposals match ground-truth if class labels agree and temporal IoU surpasses a pre-set threshold ([email protected], 0.3, 0.5).
  • Procedure Learning: Given test videos without step boundaries, recover the sequence of K essential steps. Evaluated on seen/unseen recipes using mean prefix coverage and precision@1 for step classification.
  • Zero-Shot Error Recognition: Framewise anomaly detection using methods such as SSMCTB and SSPCAB, with AUC and EER as evaluation metrics.

A summary of the benchmark design is provided in the following table:

Task Data Used Metric
Error Recognition Step segments F1, AUC
Multi-Step Localization Untrimmed video mAP@IoU
Procedure Learning Untrimmed video Prefix coverage, P@1
Zero-Shot Recognition Framewise AUC, EER

Aggregate results show, for example, Omnivore backbone achieves F1 = 53.9% for supervised error recognition and [email protected] = 41.2% on temporal localization (Peddi et al., 2023). Performance on early error recognition, error category recognition, and zero-shot detection remains limited, exposing research challenges in occlusion handling and complex error typologies.

3. Directed Graph Structure and Task Graph Mining

Each recipe scenario in CaptainCook4D is accompanied by a manually designed ground-truth DAG formalizing valid key-step orderings and precondition relationships. Task graph mining on CaptainCook4D is typically performed only on error-free (“correct”) sequences, with repeated steps collapsed to their first occurrence (Seminara et al., 25 Feb 2025, Seminara et al., 3 Jun 2024).

Procedural activity understanding on this dataset is advanced by gradient-based task graph learning methods, framed as maximum-likelihood estimation of edge weights in a continuous adjacency matrix Z∈[0,1]ⁿ⁺²×ⁿ⁺² (where n is the number of key-steps, augmented by special START and END nodes) (Seminara et al., 25 Feb 2025):

  • The likelihood of observed sequences is a product over positions, comparing the sum of edge weights from the current step to past steps versus future steps.
  • A loss function ℒ combines a log-likelihood term (pulling up correct edges) and a contrastive penalty (pushing down spurious edges).

Two principal approaches are instantiated:

  • Direct Optimization (DO): Real-valued edge score matrix A is learned via gradient descent on ℒ, masking illegal edges (e.g., cycles and self-loops), then transforming by row-wise softmax to obtain Z.
  • Task Graph Transformer (TGT): D-dimensional embeddings (from EgoVLPv2) of step names or video segments are fed to a transformer encoder, with relation-MLP heads computing pairwise scores A_{(i,j)}, forming Z as above. A distinctiveness loss encourages separation between different step embeddings.

Both methods leverage the graph Z as both a procedural representation and a probabilistic model for step feasibility, enabling downstream inference such as P(KiKj,Z)P(K_i|K_j, Z) for precondition satisfaction.

4. Quantitative Performance and Qualitative Analysis

Comprehensive evaluation of graph learning methods on CaptainCook4D shows substantial improvement over classical baselines (Seminara et al., 25 Feb 2025, Seminara et al., 3 Jun 2024):

Method Precision Recall F1
MSGI 11.9 14.0 12.8
ChatGPT (LLM) 52.9 57.4 55.0
Count-Based 66.9 56.1 61.0
MSG² 70.9 71.6 71.1
TGT-text 71.7±3.6 72.9±2.8 72.1±3.2
DO (TGML) 86.4±0.8 89.7±0.5 87.8±0.6

DO achieves a +16.7 F1 improvement over MSG², with low run-to-run variance. Qualitative analysis confirms that DO can recover >90% of true edges, with rare errors arising from systematic corpus biases (steps always co-occurring in identical order) or limited diversity in training sequences. Post-processing further reduces hallucinated dependencies and enforces acyclicity and full connectivity.

Task graph learning with textual or video features (TGT) also demonstrates “emergent” video understanding: models trained on feature-based key-steps improve pairwise ordering and future step prediction accuracy beyond chance.

5. Role in Live Guidance, Instruction, and Error Feedback

Extensions such as the Qualcomm Interactive Cooking benchmark build directly on CaptainCook4D by integrating timestamped instructions and error/success feedback for each action segment (Bhattacharyya et al., 27 Nov 2025). This enables the development and evaluation of streaming multi-modal LLMs capable of asynchronous, step-by-step task guidance and real-time mistake detection.

Capabilities enabled by the dataset:

  • Dense alignment between video, instructions, and feedback: every step receives an associated instruction, success outcome, or mistake alert aligned to ground-truth timestamps.
  • Comprehensive coverage: average 0.9 instructions per minute, mean 3.22 mistakes per video, with an extensive taxonomy of mistake types.
  • Evaluation protocols: Instruction Completion Accuracy (IC-Acc), mistake detection precision/recall/F1, and fluency metrics (e.g., ROUGE-L, BERTScore) for feedback generation.
  • Scenario-specific graphs for dynamic re-planning, leveraging topological sorting and LLM-based (Qwen3-32B) step-sequence re-computation when divergent actions occur.

A notable challenge lies in detecting high-granularity mistakes (e.g., mixing up teaspoon and tablespoon), which requires both long-horizon temporal modeling and precise visual discrimination.

6. Limitations, Open Challenges, and Prospects

While CaptainCook4D provides an indispensable testbed for procedural activity understanding—from symbolic graph mining to live multi-modal coaching—several boundaries remain:

  • Domain specificity: The data is restricted to egocentric cooking; applicability to laboratory, medical, or industrial procedural domains remains untested (Bhattacharyya et al., 27 Nov 2025).
  • Participant demographics: Only eight individuals contributed, potentially limiting generalization across broader populations (Peddi et al., 2023).
  • Error space: While incorporating seven well-defined error types, the incidence and diversity of complex or compound deviations are limited; annotation of order errors/missing steps in the main set is incomplete due to practical challenges.
  • Reactive guidance: The current benchmark design is “non-reactive,” replaying user actions offline and evaluating system instructions/feedback post hoc; live human–robot loop closure is not directly supported (Bhattacharyya et al., 27 Nov 2025).

However, the dataset’s integration with new neural architectures—particularly differentiable graph learning (Seminara et al., 3 Jun 2024, Seminara et al., 25 Feb 2025) and multi-modal LLMs (Bhattacharyya et al., 27 Nov 2025)—suggests considerable promise in transfer learning across domains and tasks. The explicit procedural graphs learned from CaptainCook4D have been shown to enhance mistake detection on external datasets (e.g., Assembly101-O, EPIC-Tent-O (Seminara et al., 3 Jun 2024)), supporting the broader thesis that structural representations underpin robust, generalizable activity understanding.

7. Impact on Procedural Activity Understanding Research

CaptainCook4D has rapidly established itself as a canonical dataset for procedural video understanding and error recognition, supporting advances in:

  • Differentiable task graph learning and symbolic-to-neural integration (Seminara et al., 3 Jun 2024, Seminara et al., 25 Feb 2025).
  • Streaming, feedback-driven instructional models for AI assistants (Bhattacharyya et al., 27 Nov 2025).
  • Systematic benchmarking of step localization, error detection, and scenario-based sequence prediction.
  • Hierarchical video annotation, enabling precise evaluation of both coarse procedural structure and fine-grained action segmentation.

Its detailed, mistake-inclusive annotations and rigorous evaluation framework continue to shape expectations for research in egocentric procedural activity, structured prediction, and interactive AI guidance.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CaptainCook4D.