Culinary Image Similarity (CIS) Metric

Updated 1 June 2026

Culinary Image Similarity (CIS) is a domain-adapted metric that quantifies cooking progress by mapping food images into a learned embedding space using a Siamese network.
CIS is integrated as a generative loss in image synthesis, combining adversarial, perceptual, and CIS losses to enforce realistic cooking transitions.
Empirical evaluations show CIS outperforms generic metrics like SSIM and LPIPS by clearly separating doneness stages and enabling real-time culinary monitoring.

The Culinary Image Similarity (CIS) metric is a domain-specific visual similarity measure designed to quantify the progression of food cooking in images and to serve as both a generative loss and a real-time cooking progress monitor. Introduced in the context of real-time cooked food image synthesis and doneness assessment, CIS leverages a learned embedding space to model the nuanced visual evolution of food during preparation, enabling applications in conditional generation and automated monitoring on edge devices (Gupta et al., 21 Nov 2025).

1. Mathematical Formulation

Let $\mathrm{fsim}(\cdot)$ denote a Siamese network-based embedding function that maps an RGB image $I \in \mathbb{R}^{H \times W \times 3}$ to a $D$ -dimensional unit-norm vector. The CIS score between two images $I_1$ and $I_2$ is the cosine similarity of their embeddings: $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$ Given that fsim outputs $L_2$ -normalized vectors, this reduces to: $F_{\text{cul}}(I_1, I_2) = \mathrm{fsim}(I_1)^\top \mathrm{fsim}(I_2)$ where $F_{\text{cul}} \in [0, 1]$ .

CIS is used in training a conditional image generator: for generated image $\hat{I}_{ds}$ and ground-truth $I \in \mathbb{R}^{H \times W \times 3}$ 0,

$I \in \mathbb{R}^{H \times W \times 3}$ 1

This loss is incorporated with adversarial ( $I \in \mathbb{R}^{H \times W \times 3}$ 2) and perceptual ( $I \in \mathbb{R}^{H \times W \times 3}$ 3) losses: $I \in \mathbb{R}^{H \times W \times 3}$ 4 Here, $I \in \mathbb{R}^{H \times W \times 3}$ 5, $I \in \mathbb{R}^{H \times W \times 3}$ 6, and $I \in \mathbb{R}^{H \times W \times 3}$ 7 are tunable weights.

For learning $I \in \mathbb{R}^{H \times W \times 3}$ 8 itself, every cooking session comprises $I \in \mathbb{R}^{H \times W \times 3}$ 9 frames $D$ 0 sampled, for example, every 30 seconds. Each pair $D$ 1 is assigned a temporal proximity target: $D$ 2 Training minimizes MSE between predicted and true similarities: $D$ 3

2. Domain Motivation and Theoretical Justification

CIS explicitly learns a food-cooking–aware visual similarity, diverging from generic metrics like SSIM or LPIPS. It is trained on real cooking session sequences, enabling the embedding space to encode visual cues (color, texture, and structure) that typify cooking transitions such as “doneness.” Supervising embeddings with $D$ 4 forces the trajectory of cooking states to evolve smoothly through the embedding space, mapping early raw frames near each other and fully cooked frames as maximally distinct.

This domain-adapted design has two core uses:

As a generative loss, $D$ 5 acts as a regularizer, enforcing culinary plausibility by matching synthetic cooking progressions to real ones.
At inference, CIS functions as a “doneness gauge”: comparing a current frame to a user-selected target state yields a progress score that tracks the underlying culinary transformation.

3. Implementation and Training Protocols

Siamese Network Architecture and Training:

Backbone: EfficientNet-B1 with random initialization.
Projection: Two-layer MLP ( $D$ 6), followed by $D$ 7 normalization.
Loss: MSE between predicted CIS and temporal target $D$ 8.
Optimizer: Adam (learning rate $D$ 9 $I_1$ 04, weight decay %%%%3 $I \in \mathbb{R}^{H \times W \times 3}$ 3%%%%25), step decay factor $I_1$ 3 every 10 epochs.
Data Augmentation: Random rotations ( $I_1$ 4), horizontal flips.
Epochs: 100.

Generator Integration:

Generator: Text-conditioned U-Net (8 million parameters), with FiLM modulation conditioned on recipe and cooking state embeddings.
Loss Weights: $I_1$ 5 (adversarial), $I_1$ 6 (LPIPS), $I_1$ 7 (CIS).
Batch size: 1.
Optimizer: Adam (learning rate $I_1$ 8 $I_1$ 94, $I_2$ 0), 100 epochs (50 at constant lr, 50 with linear decay).

Real-Time Inference and Monitoring:

User selects a reference target $I_2$ 1 from multiple generated states.
Oven camera streams frames $I_2$ 2 at 30-second intervals.
On-device compute: $I_2$ 3 in 0.3 seconds on a 5 TOPS NPU.
Maintain a sliding window (3–5 frames, 90–150 seconds) and detect peaks in $I_2$ 4 to trigger stop cooking signals at optimal doneness.

4. Hyperparameterization and Tuning Strategies

Loss Weighting ( $I_2$ 5, $I_2$ 6, $I_2$ 7): Empirically, equal weighting of LPIPS and CIS ( $I_2$ 8) achieves optimal trade-off between FID/LPIPS and semantic faithfulness. Adjusting $I_2$ 9 in $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$ 0 allows trade-offs: higher values prioritize temporal consistency, lower values favor fine texture detail.
Embedding Dimension ( $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$ 1): Controls the capacity of CIS to encode cooking variability. Increasing $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$ 2 beyond 128 provides diminishing returns relative to computational cost.
Optimizer Configuration: Recommended learning rate for $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$ 3 is %%%%53 $I_2$ $I_{2}$ 53%%%%54 $with gradual decay; aggressive decay can collapse embedding spread and reduce metric effectiveness.</li> <li><strong>Inference Smoothing:</strong> A moving average window of 3–5 frames is robust against transient noise; shorter windows increase reactivity but increase risk of early or noisy trigger events due to occlusion or lighting fluctuation.</li> </ul> <h2 class='paper-heading' id='comparative-quantitative-analysis'>5. Comparative Quantitative Analysis</h2> <p>CIS provides significantly greater discrimination among cooking stages than SSIM or LPIPS. On a test set of 715 image pairs: | Comparison | SSIM | LPIPS | CIS | |---------------------|-------|--------|-------| | Raw vs Basic | 0.722 | 0.209 | 0.519 | | Raw vs Standard | 0.663 | 0.235 | 0.366 | | Raw vs Extended | 0.628 | 0.260 | 0.214 |</p> <p>SSIM and LPIPS span only$ F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$60.1–0.15 between raw and cooked states, while CIS covers $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$70.8, enabling clear separation of doneness stages.

Ablation analysis further indicates:
- Full model (with CIS): FID = 52.18, LPIPS = 0.2145
- Without $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$8: FID = 54.98 (+5.3%), LPIPS = 0.2310 (+7.8%)
On dataset-wide baselines:
- Pix2Pix (per-state): FID = 153.00
- Pix2Pix-Turbo: FID = 75.42
- With CIS: FID = 52.18 (30% lower than Turbo, 66% lower than Pix2Pix)
Public datasets (Edge2Shoes, Edge2Handbags) show 27–40% FID reduction compared to strong Pix2Pix baselines (Gupta et al., 21 Nov 2025).

6. Temporal Dynamics and Case Studies

Empirical visualizations demonstrate CIS's alignment with culinary change:
- Raw-to-cooked Progression: Synthetic images progress from pale dough (raw) through golden crust (basic) to deep browning (extended), as observed in Figure 1 of (Gupta et al., 21 Nov 2025).
- Trajectory Plots: CIS demonstrates a monotonically decreasing profile from 1.0 to ≈0.1 throughout cooking, with inflection points corresponding to major transitions (e.g., rapid surface browning). SSIM/LPIPS remain largely flat, failing to capture these transitions.
- Stop Detection: Only CIS achieves a distinct peak at the precise chef-annotated doneness; generic metrics do not display reliable extrema.
- Salmon Steak Example: Raw vs Basic CIS ≈ 0.72, with the CIS peak corresponding to chef-defined doneness; system auto-stop is triggered within one frame of ground truth.
- Chocolate Chip Cookie Example: Initial rapid CIS drop (0–8 min) as batter sets, followed by gradual flattening; generic metrics do not show sensitivity to these transitions.
7. Summary and Broader Implications

Culinary Image Similarity is a learned, culinary-domain–aware metric whose embedding space encodes cooking progression as a continuous, temporally smooth trajectory. Its application enables:
- Improved image synthesis and visual fidelity as a generator loss.
- Real-time, chef-aligned doneness monitoring for automated cooking systems.
- Substantially enhanced discrimination of cooking stages compared to SSIM or LPIPS, yielding both improved FID/LPIPS scores and higher practical interpretability for culinary tasks (Gupta et al., 21 Nov 2025).
This suggests that domain-adapted similarity metrics such as CIS provide critical advantages in generative and monitoring applications where visual semantics evolve along constrained, meaning-rich trajectories, and generic perceptual metrics are insufficiently discriminative.

Markdown Report Issue Upgrade to Chat

References (1)

1.

Real-Time Cooked Food Image Synthesis and Visual Cooking Progress Monitoring on Edge Devices (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Culinary Image Similarity (CIS) Metric.

Culinary Image Similarity (CIS) Metric

1. Mathematical Formulation

2. Domain Motivation and Theoretical Justification

3. Implementation and Training Protocols

4. Hyperparameterization and Tuning Strategies

6. Temporal Dynamics and Case Studies

7. Summary and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Culinary Image Similarity (CIS) Metric

1. Mathematical Formulation

2. Domain Motivation and Theoretical Justification

3. Implementation and Training Protocols

4. Hyperparameterization and Tuning Strategies

6. Temporal Dynamics and Case Studies

7. Summary and Broader Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research