Edge-Efficient Recipe & Cooking State Generator

Updated 28 November 2025

The paper introduces a novel generative model that produces high-fidelity cooked-food images from raw inputs, integrating recipe and cooking-state guidance.
It leverages a U-Net based generator with FiLM conditioning and an EfficientNet-B1 Siamese network to achieve real-time synthesis and culinary progression monitoring.
Empirical results show significant improvements in FID and LPIPS with a lightweight architecture suitable for edge deployment, enhancing practical culinary image generation.

The Edge-Efficient Recipe and Cooking State Guided Generator is a generative model architecture designed for real-time synthesis of realistic cooked food images from raw inputs on edge devices. The approach facilitates user-preferred target states—such as specific doneness or texture—rather than static presets, leveraging recipe and cooking-state conditioning to remain computationally feasible for embedded deployment. It incorporates a novel culinary similarity metric as both a perceptual loss and a runtime progress signal, achieving significant improvements in fidelity and perceptual similarity over prior methods, with tightly integrated hardware-aware design and a domain-specific dataset annotated by culinary experts (Gupta et al., 21 Nov 2025).

1. Model Architecture

The generative framework consists of two principal networks: a U-Net–based generator and an EfficientNet-B1–based Siamese similarity network. The generator, denoted $G_e$ , comprises 8.7M parameters and inputs a raw food image $I_{\text{raw}} \in \mathbb{R}^{3 \times 224 \times 224}$ , a categorical recipe label $c$ (one of 30 classes), and a discrete cooking state index $d_s \in \{ \text{cs1}, \text{cs2}, \text{cs3} \}$ . It outputs a synthesized cooked-state image:

$\hat I_{d_s} = G_e(I_{\text{raw}}, c, d_s) \in \mathbb{R}^{3 \times 224 \times 224}.$

Conditioning on recipe and cooking state is implemented via FiLM (Feature-wise Linear Modulation) injected into every encoder/decoder block. The context encoding pipeline assigns each (recipe, state) pair a unique index $p_i$ , computes a 32-dimensional sinusoidal positional embedding $\mathrm{SPE}(p_i)$ , projects to a context vector $E_p$ via MLP $_\phi$ , and maps to per-layer FiLM parameters $[\gamma_l, \beta_l]$ using an MLP with parameters $\phi_l$ :

$[\gamma_l, \beta_l] = \mathrm{MLP}_{\phi_l}(E_p), \quad (\gamma_l, \beta_l \in \mathbb{R}^{F_l}).$

$z^{(l)}_{\mathrm{mod}} = \gamma_l \odot z^{(l)} + \beta_l.$

The U-Net backbone has base channel width 32 and scaling factors (1, 2, 4, 8) across four down/up-sampling stages; each stage houses two ResNet blocks with eight-group normalization. The final output is mapped to RGB via a 1×1 convolution.

For adversarial training, a 70×70 PatchGAN discriminator processes concatenated raw/cooked images using a depth-doubling cascade up to 512 channels.

2. Culinary Image Similarity Metric

Temporal consistency and culinary plausibility are achieved via the Culinary Image Similarity (CIS) metric. An EfficientNet-B1–based Siamese network $f_{\mathrm{sim}}: \mathbb{R}^{3\times224\times224} \to \mathbb{R}^{128}$ outputs L2-normalized embeddings. The cosine similarity

$F_{\mathrm{cul}}(I_i, I_j) = \cos(f_{\mathrm{sim}}(I_i), f_{\mathrm{sim}}(I_j)) \in [0,1]$

provides a scalar indication of progression along the cooking trajectory.

Training uses session-timestamped frame pairs $(I_i, I_j)$ labeled

$s_{i,j} = 1 - \frac{| t_i - t_j |}{T},$

where $t_i$ and $t_j$ are frame times and $T$ is the total session duration. Supervision employs mean squared error:

$\mathcal{L}_{\mathrm{sim}} = \mathbb{E}_{i,j} \left[ ( F_{\mathrm{cul}}(I_i, I_j) - s_{i,j} )^2 \right].$

This embedding forms the basis for generator training and runtime monitoring, enabling inference-time comparison between live and target images to detect doneness peaks.

3. Generator Training Objectives

The generator loss is a weighted sum of adversarial, perceptual, and culinary similarity terms. Given ground-truth cooked image $I_{d_s}$ and synthesized image $\hat I_{d_s}$ :

Adversarial (PatchGAN) Loss: $\mathcal{L}_{\mathrm{GAN}} = \mathbb{E}_{I_{\mathrm{raw}}, I_{d_s}}[\log D(I_{\mathrm{raw}}, I_{d_s})] + \mathbb{E}_{I_{\mathrm{raw}}}[\log(1 - D(I_{\mathrm{raw}}, \hat I_{d_s}))]$
LPIPS Perceptual Loss: $\mathcal{L}_{\mathrm{LPIPS}} = \mathbb{E}\left[ \|\,\phi(I_{d_s}) - \phi(\hat I_{d_s}) \| \right]$

using a fixed VGG-based LPIPS extractor.

Culinary Image Similarity Loss: $\mathcal{L}_{\mathrm{CIS}} = \mathbb{E}[1 - F_{\mathrm{cul}}(I_{d_s}, \hat I_{d_s})]$

The full objective:

$\mathcal{L}_{\mathrm{gen}} = \lambda_1 \mathcal{L}_{\mathrm{GAN}} + \lambda_2 \mathcal{L}_{\mathrm{LPIPS}} + \lambda_3 \mathcal{L}_{\mathrm{CIS}},$

with $\lambda_1=1$ , $\lambda_2=50$ , $\lambda_3=50$ .

Temporal consistency arises from discrete state conditioning and the CIS term, which aligns the generator outputs to a continuous culinary progression in embedding space rather than using explicit 3D or recurrent modeling.

4. Dataset and Data Processing

The approach leverages a novel dataset consisting of 1,708 oven-based sessions spanning 30 recipes, with each session producing a raw starting image and frames every 30 seconds, culminating in at least three chef-annotated edible states:

cs1 = "basic cook" (edible)
cs2 = "standard" (ideal)
cs3 = "extended" (extra texture/browning)

Images are uniformly sized at 224×224, captured from a top-mounted camera with the oven closed. Session partitioning is 70/10/20 for train/val/test splits. Training employs on-the-fly augmentations including random horizontal flipping and ±60° rotation.

5. Training Protocols and Hyperparameters

Similarity Network $f_{\mathrm{sim}}$ :

Optimizer: Adam (lr=1e-4, weight decay=1e-5)
Step decay: $\gamma=0.6$ every 10 epochs
Epochs: 100; batch size 32 (all session pairs per batch)

Generator and Discriminator:

Optimizer: Adam (lr=2e-4, $\beta_1=0.5$ )
Batch size: 1
Epochs: 100 (first 50 epochs constant lr, then linear decay)

The entire high-level training and inference pipeline is detailed in the following pseudocode:

for epoch in 1..100:
  for each session S:
    let frames = [I_0, I_1, ..., I_T]
    for all pairs (i,j) in S:
      s_ij = 1 - |t_i - t_j|/T
      e_i = f_sim(I_i);  e_j = f_sim(I_j)
      sim = cosine(e_i, e_j)
      L_sim += (sim - s_ij)^2
  update f_sim via Adam

for epoch in 1..100:
  for each training example:
    I_raw, c, d_s, I_ds = load_batch()
    # Context embedding
    p_i = index(c,d_s)
    E_p = MLP_pos(SPE(p_i))
    # Forward G_e with FiLM at every layer using E_p
    I_hat = G_e(I_raw, E_p)
    # Discriminator loss
    L_GAN = AdvLoss(D, I_raw, I_ds, I_hat)
    # LPIPS loss
    L_LPIPS = LPIPS(I_hat, I_ds)
    # CIS loss
    L_CIS   = 1 - cosine(f_sim(I_hat), f_sim(I_ds))
    L_gen = L_GAN + 50*L_LPIPS + 50*L_CIS
    update G_e via Adam
    update D via Adam

Given I_raw, user picks desired state d_s*
I_targets = G_e(I_raw, c, cs1..cs3)
display I_targets→user picks I_goal
start oven; every 30s capture I_t:
  p = cosine(f_sim(I_t), f_sim(I_goal))
  update sliding window of p; if local peak→STOP

6. Edge Deployment and Efficiency

Deployment is targeted toward edge hardware. Only the generator ($8.68$M parameters) and similarity network ($5.4$M trainable parameters) are retained at inference, exported via PyTorch→ONNX→NPU runtime. Hybrid quantization (float16 convolutional weights, int8 activations) yields an on-disk footprint of approximately 45 MB. On a 5 TOPS embedded NPU, generation requires ~1.2 s/image (3.6 s for all target states) and the CIS comparison ~0.3 s/pair.

7. Empirical Evaluation

Performance is benchmarked against established baselines on the in-domain cooking dataset (715 test set pairs) as well as public edge2shoes and edge2handbags datasets. The following table summarizes key findings:

Method	#Params	FID↓	LPIPS↓
Pix2Pix [9]	163 M	153.00	0.4711
Pix2Pix-Turbo [19]	1290 M	75.42	0.2523
Ours	8.68 M	52.18	0.2145
- no recipe guide	8.68 M	58.74	0.2398
- no CIS loss	8.68 M	54.98	0.2310

On public datasets, the FID improvement is approximately 60% over Pix2Pix. The ablation entries indicate that excluding recipe guidance or excluding CIS loss both degrade performance. This suggests strong contributions from both explicit conditioning and the culinary similarity metric.

The Edge-Efficient Recipe and Cooking State Guided Generator thus combines domain-specific conditioning, an efficient architecture, and a culinary progression metric to achieve practical, high-fidelity generative food synthesis and progress monitoring under edge hardware constraints (Gupta et al., 21 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Edge-Efficient Recipe and Cooking State Guided Generator.