Qualcomm Interactive Cooking

Updated 2 December 2025

Qualcomm Interactive Cooking is a comprehensive suite that combines datasets, benchmarks, and edge-optimized generative models for real-time culinary guidance.
It leverages a lightweight U-Net architecture with FiLM modulation and cosine similarity metrics to synthesize visually plausible cooked food states based on user-selected doneness.
The system incorporates streaming multimodal LLMs and a detailed benchmark to deliver live instructional feedback, mistake detection, and progress monitoring in interactive cooking workflows.

Qualcomm Interactive Cooking refers to a suite of datasets, benchmarks, and edge-deployable generative and streaming models enabling real-time, interactive cooking guidance and cooked food image synthesis, specifically designed for stepwise, user-situated coaching and visual progress monitoring in home cooking settings. This encompasses both (1) an edge-optimized system for generating cooked food state images and monitoring doneness on Qualcomm NPUs, and (2) the Qualcomm Interactive Cooking benchmark and task suite for evaluating multimodal LLMs on live, frame-based instructional guidance, mistake detection, and feedback generation within dense, temporally labeled cooking workflows (Gupta et al., 21 Nov 2025, Bhattacharyya et al., 27 Nov 2025).

1. Food Image Synthesis and Progress Monitoring on Qualcomm Edge Devices

The Qualcomm Interactive Cooking system for food synthesis formulates the prediction of visually plausible cooked states from raw images, conditioned on user-selected doneness and recipe, as an image-to-image generative modeling task. The generator $G_\theta$ adopts a single, lightweight U-Net–style architecture (≈8.7M parameters), accepting as input:

$I_\text{raw} \in \mathbb{R}^{224\times224\times3}$ : the raw camera image,
$c$ : a recipe identifier,
$d_s\in\{\text{cs}_1, \text{cs}_2, \ldots, \text{cs}_n\}$ : a discrete doneness state (“basic,” “standard,” or “extended”).

Conditioning is achieved via learned sinusoidal positional embeddings $E_{p_i}$ , with FiLM-based scaling and shifting of encoder/decoder features at each depth, enabling strong parameter efficiency and flexible per-recipe/state control. The forward computation interleaves convolutions and FiLM modulation, with skip-connections, and outputs a 3-channel image $\,\hat{I}_{d_s}$ (Gupta et al., 21 Nov 2025).

The training objective is a composite generator loss:

$L_\text{gen} = \lambda_1 L_\text{GAN} + \lambda_2 L_\text{LPIPS} + \lambda_3 L_\text{CIS}$

with domain-aware weights $(\lambda_1,\lambda_2,\lambda_3) = (1,50,50)$ . The loss combines adversarial realism (PatchGAN), deep perceptual similarity (LPIPS), and “Culinary Image Similarity” (CIS), a bespoke cosine-similarity metric defined for session-aligned cooked food progression images. The CIS metric enforces both temporal smoothness and culinary plausibility: a 128-dim Siamese embedding $f_\text{sim}$ projects each image to unit norm, and

$F_\text{cul}(I_i, I_j) = \cos(f_\text{sim}(I_i), f_\text{sim}(I_j)) \in [0,1]$

serves as both training signal (minimizing difference from a temporal label) and runtime indicator of doneness progression.

Training occurs on a dataset comprising 1,708 oven sessions (30 recipes; 3 chef-annotated states per session; 70/10/20 split). Standard image augmentations are applied.

2. Edge Optimization and Deployment on Qualcomm Hardware

To achieve real-time operation on resource-constrained appliances, the entire pipeline employs hybrid quantization (float16 weights, int8 activations) for both the generator $G_\theta$ and CIS network $f_\text{sim}$ . Structured channel pruning removes ≈10–20% of channels in deeper ResNet blocks. All convolutions are 3×3 depthwise-separable, with group-norm normalization; heavy modules such as dilated convolutions and self-attention are omitted to ensure minimal inference latency and memory footprint.

On a 5 TOPS Qualcomm NPU, the model footprint is ≈45 MB ROM (generator ≈40 MB, similarity net ≈5 MB), ≈200 MB peak RAM, and ≈4 GFLOPs per generation pass. Per-frame generation latency is ≈1.2 s (224×224), with CIS comparison at 0.3 s per pair. Three target images are synthesized in ≈3.6 s at session start, enabling user-interactive selection of doneness appearance. All computations are performed on-device using Snapdragon 8cx Gen 3/8c Gen 2, leveraging Hexagon DSP and Qualcomm AI Engine via ONNX/QNSDK/Hexagon NN libraries (Gupta et al., 21 Nov 2025).

3. Interactive Cooking Guidance: Dataset, Benchmark, and Task Suite

The Qualcomm Interactive Cooking benchmark, introduced in the context of situated, streaming multimodal LLM guidance (Bhattacharyya et al., 27 Nov 2025), builds upon CaptainCook4D, extending it with:

384 egocentric cooking videos (~94 hours), each with temporal action and mistake annotations.
Densely timed, step-by-step natural-language “instructions” (inserted at each action segment’s start) and “feedback” (success or mistake alerts), each precisely timestamped.
Broad coverage of home-cooking tasks (e.g., slicing, mixing, heating) in both “linear” plan-following and “advanced” (out-of-order/divergent) planning regimes.

Mistake annotations are manually timestamped at the first visual indication of error. Categories include preparation, technique, measurement, temperature, and timing errors (order and missing-step errors omitted from timestamped feedback due to re-planning complexity). Dataset statistics and splits are structured as follows:

Set	Videos	Hours	Instructions	Feedbacks
Main Set: Train	213	52.4	2,913	3,080
Main Set: Val	62	15.7	861	916
Main Set: Test	109	26.4	1,489	1,580
Adv. Set: Train	209	51.5	2,888 (+244)	2,771

Instructions and feedback are streamed asynchronously: each instruction is presented at the step start; feedback is emitted at action end (for successes) or at mistake frame (for errors), without explicit user prompting.

4. Streaming LLM Architectures and Model Evaluation

LiveMamba is the reference streaming multimodal LLM developed for this benchmark. The architecture comprises:

Vision encoder: InternViT-300M-448px (≈1,025 tokens/frame at 2 fps).
Q-Former adapter (4 cross-attention layers) downselecting to 32 visual tokens.
Mamba-130M LLM backbone (selective-state recurrent LLM).
Special stream tokens: “<vision>” (request next frame), “<response>” (emit text).

Inference proceeds autoregressively: after each frame, the model decides whether to emit an instruction/feedback or continue reading frames. In the advanced set, re-planning is delegated to an external Qwen3-32B planner which, upon divergence, updates the plan and infers step alignment; LiveMamba itself does not maintain plan memory.

Benchmark evaluation comprises four principal tasks:

Instruction completion detection (“has the user completed step $k$ ?”).
Mistake detection (“has the user made a mistake in this step?”).
Feedback generation (natural-language correction/acknowledgement).
Advanced re-planning (upon divergence).

Metrics include instruction completion accuracy (IC-Acc), mistake detection precision/recall/F1, temporal alignment error, and feedback fluency (BERTScore, ROUGE-L). A correct completion or mistake alert must occur within ±15 s of ground-truth.

5. Experimental Results and Failure Analysis

Zero-shot streaming LLMs (GPT-4o, LLaVA-NeXT, Qwen2-VL, Gemini-2.5-Flash) underperform in fine-grained, live progress tracking (IC-Acc <20%). VideoLLM-online and Qwen2-VL-7B achieve IC-Acc 0.03% and 6.3%, respectively. Fine-tuned LiveMamba achieves IC-Acc=31.5%, F1=0.13 (main set); performance drops for advanced planning (IC-Acc=12.6%) due to out-of-order step complexity. BERTScore and ROUGE-L for feedback fluency are higher for LiveMamba (full) than all baselines. Turn-based (step-wise) evaluation yields higher IC-Acc (LiveMamba†=51.0%), suggesting that streaming, asynchronous mode introduces significant difficulty for current models (Bhattacharyya et al., 27 Nov 2025).

Failure analysis indicates that subtle measurement errors (e.g., mismeasuring ingredients), small object manipulations, and temporally precise mistake alerts remain challenging. The need to rapidly align visual evidence to instruction steps and mistakes exposes model limitations in temporal segmentation and action understanding.

6. Interactive UX and Practical System Integration

On-device, the workflow involves capturing an initial raw frame, synthesizing all three target doneness states, and displaying these as selectable tiles. The user selects a preferred visual target, then the oven operates while the NPU periodically (every 30 s) evaluates live frames with the CIS similarity to indicate progression. The system halts cooking when the CIS-derived progress peaks locally and provides final visual overlays (real vs. target) as well as haptic/audio feedback (beeps at 75% and completion). Optionally, oven parameters (temperature, time) may be adjusted based on pace of progress.

All inference, similarity, and monitoring are performed edge-only; responsiveness (<2 s synthesis, <0.5 s similarity comparison) enables real-time deployment. The infrastructure relies on Qualcomm Neural Processing SDK, NNAPI, and Hexagon NN, supporting both Android Things and embedded Linux environments.

This suggests that such real-time, in-situ culinary assistants are now feasible for consumer appliances where cloud compute is infeasible or privacy-sensitive. A plausible implication is that the core methodologies (domain-specific image similarity, temporally-dense annotation, streaming feedback) may generalize to other procedural domains such as assembly or fitness coaching.

7. Limitations and Future Research Directions

The Qualcomm Interactive Cooking ecosystem is, at present, domain-specific (cooking, egocentric view) and does not exploit audio or variable camera rates. Advanced planning, subtle mistake detection, and early corrective feedback remain open challenges. Future directions identified by the originating research include expansion to new domains (e.g., DIY tasks), incorporation of richer modalities (speech, hand-tracking), improved in-memory re-planning for LLMs, and development of more efficient streaming attention for extended edge deployment scenarios (such as AR glasses) (Bhattacharyya et al., 27 Nov 2025).

Qualcomm Interactive Cooking thus constitutes both a practical reference for embedded cooking guidance and a formal benchmark for real-time, live, multimodal instructional interaction.