Edge-Efficient Recipe & Cooking State Generator
- The paper introduces a novel generative model that produces high-fidelity cooked-food images from raw inputs, integrating recipe and cooking-state guidance.
- It leverages a U-Net based generator with FiLM conditioning and an EfficientNet-B1 Siamese network to achieve real-time synthesis and culinary progression monitoring.
- Empirical results show significant improvements in FID and LPIPS with a lightweight architecture suitable for edge deployment, enhancing practical culinary image generation.
The Edge-Efficient Recipe and Cooking State Guided Generator is a generative model architecture designed for real-time synthesis of realistic cooked food images from raw inputs on edge devices. The approach facilitates user-preferred target states—such as specific doneness or texture—rather than static presets, leveraging recipe and cooking-state conditioning to remain computationally feasible for embedded deployment. It incorporates a novel culinary similarity metric as both a perceptual loss and a runtime progress signal, achieving significant improvements in fidelity and perceptual similarity over prior methods, with tightly integrated hardware-aware design and a domain-specific dataset annotated by culinary experts (Gupta et al., 21 Nov 2025).
1. Model Architecture
The generative framework consists of two principal networks: a U-Net–based generator and an EfficientNet-B1–based Siamese similarity network. The generator, denoted , comprises 8.7M parameters and inputs a raw food image , a categorical recipe label (one of 30 classes), and a discrete cooking state index . It outputs a synthesized cooked-state image:
Conditioning on recipe and cooking state is implemented via FiLM (Feature-wise Linear Modulation) injected into every encoder/decoder block. The context encoding pipeline assigns each (recipe, state) pair a unique index , computes a 32-dimensional sinusoidal positional embedding , projects to a context vector via MLP, and maps to per-layer FiLM parameters using an MLP with parameters :
The U-Net backbone has base channel width 32 and scaling factors (1, 2, 4, 8) across four down/up-sampling stages; each stage houses two ResNet blocks with eight-group normalization. The final output is mapped to RGB via a 1×1 convolution.
For adversarial training, a 70×70 PatchGAN discriminator processes concatenated raw/cooked images using a depth-doubling cascade up to 512 channels.
2. Culinary Image Similarity Metric
Temporal consistency and culinary plausibility are achieved via the Culinary Image Similarity (CIS) metric. An EfficientNet-B1–based Siamese network outputs L2-normalized embeddings. The cosine similarity
provides a scalar indication of progression along the cooking trajectory.
Training uses session-timestamped frame pairs labeled
where and are frame times and is the total session duration. Supervision employs mean squared error:
This embedding forms the basis for generator training and runtime monitoring, enabling inference-time comparison between live and target images to detect doneness peaks.
3. Generator Training Objectives
The generator loss is a weighted sum of adversarial, perceptual, and culinary similarity terms. Given ground-truth cooked image and synthesized image :
- Adversarial (PatchGAN) Loss:
- LPIPS Perceptual Loss:
using a fixed VGG-based LPIPS extractor.
- Culinary Image Similarity Loss:
The full objective:
with , , .
Temporal consistency arises from discrete state conditioning and the CIS term, which aligns the generator outputs to a continuous culinary progression in embedding space rather than using explicit 3D or recurrent modeling.
4. Dataset and Data Processing
The approach leverages a novel dataset consisting of 1,708 oven-based sessions spanning 30 recipes, with each session producing a raw starting image and frames every 30 seconds, culminating in at least three chef-annotated edible states:
- cs1 = "basic cook" (edible)
- cs2 = "standard" (ideal)
- cs3 = "extended" (extra texture/browning)
Images are uniformly sized at 224×224, captured from a top-mounted camera with the oven closed. Session partitioning is 70/10/20 for train/val/test splits. Training employs on-the-fly augmentations including random horizontal flipping and ±60° rotation.
5. Training Protocols and Hyperparameters
Similarity Network :
- Optimizer: Adam (lr=1e-4, weight decay=1e-5)
- Step decay: every 10 epochs
- Epochs: 100; batch size 32 (all session pairs per batch)
Generator and Discriminator:
- Optimizer: Adam (lr=2e-4, )
- Batch size: 1
- Epochs: 100 (first 50 epochs constant lr, then linear decay)
The entire high-level training and inference pipeline is detailed in the following pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 |
for epoch in 1..100:
for each session S:
let frames = [I_0, I_1, ..., I_T]
for all pairs (i,j) in S:
s_ij = 1 - |t_i - t_j|/T
e_i = f_sim(I_i); e_j = f_sim(I_j)
sim = cosine(e_i, e_j)
L_sim += (sim - s_ij)^2
update f_sim via Adam
for epoch in 1..100:
for each training example:
I_raw, c, d_s, I_ds = load_batch()
# Context embedding
p_i = index(c,d_s)
E_p = MLP_pos(SPE(p_i))
# Forward G_e with FiLM at every layer using E_p
I_hat = G_e(I_raw, E_p)
# Discriminator loss
L_GAN = AdvLoss(D, I_raw, I_ds, I_hat)
# LPIPS loss
L_LPIPS = LPIPS(I_hat, I_ds)
# CIS loss
L_CIS = 1 - cosine(f_sim(I_hat), f_sim(I_ds))
L_gen = L_GAN + 50*L_LPIPS + 50*L_CIS
update G_e via Adam
update D via Adam
Given I_raw, user picks desired state d_s*
I_targets = G_e(I_raw, c, cs1..cs3)
display I_targets→user picks I_goal
start oven; every 30s capture I_t:
p = cosine(f_sim(I_t), f_sim(I_goal))
update sliding window of p; if local peak→STOP |
6. Edge Deployment and Efficiency
Deployment is targeted toward edge hardware. Only the generator ($8.68$M parameters) and similarity network ($5.4$M trainable parameters) are retained at inference, exported via PyTorch→ONNX→NPU runtime. Hybrid quantization (float16 convolutional weights, int8 activations) yields an on-disk footprint of approximately 45 MB. On a 5 TOPS embedded NPU, generation requires ~1.2 s/image (3.6 s for all target states) and the CIS comparison ~0.3 s/pair.
7. Empirical Evaluation
Performance is benchmarked against established baselines on the in-domain cooking dataset (715 test set pairs) as well as public edge2shoes and edge2handbags datasets. The following table summarizes key findings:
| Method | #Params | FID↓ | LPIPS↓ |
|---|---|---|---|
| Pix2Pix [9] | 163 M | 153.00 | 0.4711 |
| Pix2Pix-Turbo [19] | 1290 M | 75.42 | 0.2523 |
| Ours | 8.68 M | 52.18 | 0.2145 |
| - no recipe guide | 8.68 M | 58.74 | 0.2398 |
| - no CIS loss | 8.68 M | 54.98 | 0.2310 |
On public datasets, the FID improvement is approximately 60% over Pix2Pix. The ablation entries indicate that excluding recipe guidance or excluding CIS loss both degrade performance. This suggests strong contributions from both explicit conditioning and the culinary similarity metric.
The Edge-Efficient Recipe and Cooking State Guided Generator thus combines domain-specific conditioning, an efficient architecture, and a culinary progression metric to achieve practical, high-fidelity generative food synthesis and progress monitoring under edge hardware constraints (Gupta et al., 21 Nov 2025).