Qualcomm Interactive Cooking Benchmark

Updated 4 April 2026

Qualcomm Interactive Cooking Benchmark is a multimodal framework that evaluates live, step-by-step task guidance in egocentric cooking videos.
It leverages densely annotated data from the CaptainCook4D dataset to measure real-time instruction delivery, completion detection, and mistake diagnosis.
The benchmark employs rigorous evaluation protocols and advanced baselines like LiveMamba to drive improvements in interactive AI coaching performance.

Qualcomm Interactive Cooking Benchmark is a large-scale, multimodal evaluation framework for assessing the capabilities of models—especially multi-modal LLMs—in delivering live, step-by-step task guidance during egocentric cooking activities. It is constructed atop the CaptainCook4D dataset and incorporates dense, time-aligned human annotations including instructions, success acknowledgments, and precisely timestamped mistake alerts. Its primary goal is to enable rigorous evaluation and advancement of streaming AI assistants capable of situated coaching in real time, with particular emphasis on instruction delivery, completion detection, and mistake diagnosis under asynchronous, non-turn-based interaction constraints (Bhattacharyya et al., 27 Nov 2025).

1. Dataset Construction and Structure

The benchmark extends CaptainCook4D, which consists of 384 egocentric videos of individuals preparing various recipes (e.g., bruschetta, scrambled eggs, tuna wraps), each video annotated with graph-structured recipes and temporally segmented actions, with segments averaging approximately 52.8 seconds at 2 frames per second. The Qualcomm extensions introduce:

Densely annotated, time-aligned instructional steps (“plan steps”) grouped from action segments, sorted by start time, and collapsed into single instructions for parallel tasks.
Feedback annotations comprising both success messages (“You’ve successfully ...”) at action end times and mistake alerts (“You made a mistake ...”) precisely timestamped to error occurrence.
Removal of “order errors” and “missing steps” from mistake taxonomy, concentrating on five categories: preparation, technique, measurement, temperature, and timing errors.
Two distinct subsets:
- Main Set (non-divergent): Users largely follow step sequence.
- Advanced Planning Set (divergent): Out-of-order steps necessitate re-planning.
Replanning subset statistics: Among videos requiring replanning (about one third in Advanced), divergence occurs roughly every 2.7 instructions or every 2.6 minutes; advanced set videos average ≈13 steps with 4–6 re-plan events each.

Data Scale

Split	Videos	Length (hrs)	#Instructions	Success FB	Mistake FB
Training	213	52.4	2,913	2,394	686
Validation	62	15.7	861	659	257
Testing	109	26.4	1,489	1,135	445

Average training video is 15 minutes. Tasks range from simple operations (e.g., slicing) to intricate, concurrent steps (e.g., toasting bread and frying eggs in parallel).

2. Benchmark Task Design

The benchmark is structured to emulate live, asynchronous guidance rather than traditional turn-based dialog. A model receives a continuous video stream (2 fps) and a structured recipe plan, and is evaluated simultaneously across three interleaved capabilities:

Instruction Delivery: Emit the next instruction at the planned step’s start timestamp, autonomously (no external prompt).
Completion Detection: Signal step completion by detecting correct action execution via egocentric visual input.
Mistake Detection & Feedback: Immediately generate natural language alerts for any observed mistake segment as events occur in real time.

Notably, the system must determine “when to speak” without user prompting and react promptly to dynamic visual cues.

For the Advanced Planning Set, if an out-of-order step is detected, feedback begins with “You did not follow the instruction ...”, and an external replanner is invoked to update subsequent instructions.

3. Evaluation Protocols and Metrics

Multiple criteria assess performance under a tolerance window $\Delta = 30$ seconds (approximately 25% of typical step length):

Instruction Completion Accuracy (IC-Acc):

$\text{IC-Acc} = \frac{\#\{\text{correctly detected completions}\}}{N_\text{instr}}$

A completion is correct if $|t_\text{det} - t_\text{gt}| \leq 30$ s.

Mistake Detection Metrics: For each $\Delta$ $Δ$ window,
- Precision $P = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}$ ,
- Recall $R = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}$ ,
- F1 Score $F1 = 2\frac{P R}{P + R}$ ,
- where TP, FP, and FN refer to true positives, false positives, and false negatives based on alignment of detection with ground truth.
Timing Error: For each TP, $\Delta t = t_\text{detected} - t_\text{ground\_truth}$ .
Feedback Fluency: Textual quality of correct detections is scored via BERTScore and ROUGE-L against ground truth.
Latency & Throughput: Token latency (frame-to-token delay, mean ≈1.1 s), real-time factor (processing at 8.1 fps vs. input 2 fps yields ≈4 $\times$ ), replanning latency (Qwen3-32B: ≈6.1 s).

4. Baseline and Model Performance

A comprehensive evaluation covers both zero-shot and fine-tuned LLMs under streaming constraints. All streaming baselines use prompting or “helper LLM” queries at 5-second intervals.

Zero-Shot Results—Main Set

Method	IC-Acc↑	P↑	R↑	F1↑	BERT↑	ROUGE-L↑
Gemini-2.5-Flash	23.1	0.01	0.22	0.02	0.410	0.342
Qwen2.5-VL-7B	18.9	0.18	0.01	0.02	0.299	0.219
VideoLLM-online	0.03	0.02	0.98	0.04	0.332	0.248
(others)	<2.0	~0	<0.7	~0	0.0	0.0

No off-the-shelf video LLM achieves reliable performance: IC-Acc <2% for most, low mistake F1 throughout, and over-alerting yields high recall but negligible precision.

Fine-Tuned Performance—Main Set

Method	IC-Acc↑	P↑	R↑	F1↑	BERT↑	ROUGE-L↑
VideoLLM-online†	7.6	0.04	0.01	0.01	0.434	0.412
LiveMamba (w/o-ICAug)	7.8	0.05	0.01	0.01	0.605	0.542
LiveMamba (w/o-CFAug)	14.3	0.12	0.03	0.05	0.558	0.511
LiveMamba (full)	31.5	0.17	0.10	0.13	0.651	0.561

On the Advanced Set, LiveMamba with full data augmentation and iterative replanning achieves additional gains (IC-Acc 12.6%, F1 0.19, BERTScore 0.941, ROUGE-L 0.927).

Turn-Based Evaluation

Performance for all models improves when steps are independent: LiveMamba† achieves IC-Acc = 51.0%, F1 = 0.19; Qwen2.5-VL-7B obtains 38.9% and F1 = 0.06.

5. LiveMamba Model Architecture and Adaptations

LiveMamba is introduced as a streaming, low-latency, multi-modal LLM baseline, distinguished by several architectural and procedural innovations:

Vision Frontend: InternViT-300M-448px-V2_5 (1,024 tokens/frame), Q-Former adapter with 4 cross-attention layers distills to 32 tokens/frame for downstream LM; VisionZip pruning reduces input to 256 tokens per frame pre-adaptation.
Language Core: Recurrent Mamba-130M for linear-time context modeling over long video/text inputs, optimized for low memory and edge deployment.
Control Tokens: <vision> (request next frame), <response> (emit instruction, acknowledgment, or alert at precise timestep), supporting “when-to-say” autonomy.

Data Augmentation

Temporal Jitter: Randomize instruction starts within ±30 s windows to simulate natural drift.
Instruction-Completion Augmentation: Transform segments from EPIC-KITCHENS/Ego4D into successful step pairs (mistake-free) to enhance positive class detection.
Counterfactual Mistakes: Employ Qwen2.5-32B to generate plausible noun/verb swaps (e.g., “wash beans”→“wash carrots”) as synthetic mistakes, annotated with “point-of-no-return” timestamps.

Iterative Re-planning

Upon out-of-sequence step detection, LiveMamba invokes Qwen3-32B to (1) identify the unintended step, then (2) determine whether to repeat or skip instructions, pruning future tasks based on recipe graphs.

Performance Characteristics

LiveMamba (full fine-tuning) doubles zero-shot best IC-Acc (31.5% vs. 23.1%) and increases mistake F1 from ≈0.02 to 0.13.
Maintains low average latency (~1.1 s per token), operates at ~4 $\times$ the video frame rate on a single H100.
With iterative re-planning, further gains on the Advanced Set are observed.

6. Significance and Research Implications

Qualcomm Interactive Cooking Benchmark establishes the first dedicated, large-scale evaluation suite for live, asynchronous, stepwise task guidance in realistic, mistake-rich egocentric settings. It introduces precisely timestamped error conditions, enabling robust evaluation of models’ ability to deliver timely, non-prompted instructional interventions and feedback in open-ended visual scenarios. LiveMamba demonstrates that architectural efficiency, control over output timing, data augmentation strategies, and dynamic replanning jointly improve situated coaching performance—achieving a 360% improvement in step completion accuracy and a 500% increase in mistake F1 relative to prior streaming video-LLM solutions (Bhattacharyya et al., 27 Nov 2025).

A plausible implication is that this benchmark will catalyze development of interactive video assistants with real-time situational awareness and adaptive instructional capabilities, directly supporting the broader trajectory toward general-purpose AI coaching agents under unconstrained, asynchronous user behaviors.

Markdown Report Issue Upgrade to Chat

References (1)

Can Multi-Modal LLMs Provide Live Step-by-Step Task Guidance? (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qualcomm Interactive Cooking Benchmark.