Papers
Topics
Authors
Recent
Search
2000 character limit reached

Culinary Image Similarity (CIS) Metric

Updated 1 June 2026
  • Culinary Image Similarity (CIS) is a domain-adapted metric that quantifies cooking progress by mapping food images into a learned embedding space using a Siamese network.
  • CIS is integrated as a generative loss in image synthesis, combining adversarial, perceptual, and CIS losses to enforce realistic cooking transitions.
  • Empirical evaluations show CIS outperforms generic metrics like SSIM and LPIPS by clearly separating doneness stages and enabling real-time culinary monitoring.

The Culinary Image Similarity (CIS) metric is a domain-specific visual similarity measure designed to quantify the progression of food cooking in images and to serve as both a generative loss and a real-time cooking progress monitor. Introduced in the context of real-time cooked food image synthesis and doneness assessment, CIS leverages a learned embedding space to model the nuanced visual evolution of food during preparation, enabling applications in conditional generation and automated monitoring on edge devices (Gupta et al., 21 Nov 2025).

1. Mathematical Formulation

Let fsim()\mathrm{fsim}(\cdot) denote a Siamese network-based embedding function that maps an RGB image IRH×W×3I \in \mathbb{R}^{H \times W \times 3} to a DD-dimensional unit-norm vector. The CIS score between two images I1I_1 and I2I_2 is the cosine similarity of their embeddings: Fcul(I1,I2)=cos(fsim(I1),fsim(I2))=fsim(I1)fsim(I2)fsim(I1)2fsim(I2)2F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2} Given that fsim outputs L2L_2-normalized vectors, this reduces to: Fcul(I1,I2)=fsim(I1)fsim(I2)F_{\text{cul}}(I_1, I_2) = \mathrm{fsim}(I_1)^\top \mathrm{fsim}(I_2) where Fcul[0,1]F_{\text{cul}} \in [0, 1].

CIS is used in training a conditional image generator: for generated image I^ds\hat{I}_{ds} and ground-truth IRH×W×3I \in \mathbb{R}^{H \times W \times 3}0,

IRH×W×3I \in \mathbb{R}^{H \times W \times 3}1

This loss is incorporated with adversarial (IRH×W×3I \in \mathbb{R}^{H \times W \times 3}2) and perceptual (IRH×W×3I \in \mathbb{R}^{H \times W \times 3}3) losses: IRH×W×3I \in \mathbb{R}^{H \times W \times 3}4 Here, IRH×W×3I \in \mathbb{R}^{H \times W \times 3}5, IRH×W×3I \in \mathbb{R}^{H \times W \times 3}6, and IRH×W×3I \in \mathbb{R}^{H \times W \times 3}7 are tunable weights.

For learning IRH×W×3I \in \mathbb{R}^{H \times W \times 3}8 itself, every cooking session comprises IRH×W×3I \in \mathbb{R}^{H \times W \times 3}9 frames DD0 sampled, for example, every 30 seconds. Each pair DD1 is assigned a temporal proximity target: DD2 Training minimizes MSE between predicted and true similarities: DD3

2. Domain Motivation and Theoretical Justification

CIS explicitly learns a food-cooking–aware visual similarity, diverging from generic metrics like SSIM or LPIPS. It is trained on real cooking session sequences, enabling the embedding space to encode visual cues (color, texture, and structure) that typify cooking transitions such as “doneness.” Supervising embeddings with DD4 forces the trajectory of cooking states to evolve smoothly through the embedding space, mapping early raw frames near each other and fully cooked frames as maximally distinct.

This domain-adapted design has two core uses:

  • As a generative loss, DD5 acts as a regularizer, enforcing culinary plausibility by matching synthetic cooking progressions to real ones.
  • At inference, CIS functions as a “doneness gauge”: comparing a current frame to a user-selected target state yields a progress score that tracks the underlying culinary transformation.

3. Implementation and Training Protocols

Siamese Network Architecture and Training:

  • Backbone: EfficientNet-B1 with random initialization.
  • Projection: Two-layer MLP (DD6), followed by DD7 normalization.
  • Loss: MSE between predicted CIS and temporal target DD8.
  • Optimizer: Adam (learning rate DD9I1I_104, weight decay %%%%3IRH×W×3I \in \mathbb{R}^{H \times W \times 3}3%%%%25), step decay factor I1I_13 every 10 epochs.
  • Data Augmentation: Random rotations (I1I_14), horizontal flips.
  • Epochs: 100.

Generator Integration:

  • Generator: Text-conditioned U-Net (8 million parameters), with FiLM modulation conditioned on recipe and cooking state embeddings.
  • Loss Weights: I1I_15 (adversarial), I1I_16 (LPIPS), I1I_17 (CIS).
  • Batch size: 1.
  • Optimizer: Adam (learning rate I1I_18I1I_194, I2I_20), 100 epochs (50 at constant lr, 50 with linear decay).

Real-Time Inference and Monitoring:

  • User selects a reference target I2I_21 from multiple generated states.
  • Oven camera streams frames I2I_22 at 30-second intervals.
  • On-device compute: I2I_23 in 0.3 seconds on a 5 TOPS NPU.
  • Maintain a sliding window (3–5 frames, 90–150 seconds) and detect peaks in I2I_24 to trigger stop cooking signals at optimal doneness.

4. Hyperparameterization and Tuning Strategies

  • Loss Weighting (I2I_25, I2I_26, I2I_27): Empirically, equal weighting of LPIPS and CIS (I2I_28) achieves optimal trade-off between FID/LPIPS and semantic faithfulness. Adjusting I2I_29 in Fcul(I1,I2)=cos(fsim(I1),fsim(I2))=fsim(I1)fsim(I2)fsim(I1)2fsim(I2)2F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}0 allows trade-offs: higher values prioritize temporal consistency, lower values favor fine texture detail.
  • Embedding Dimension (Fcul(I1,I2)=cos(fsim(I1),fsim(I2))=fsim(I1)fsim(I2)fsim(I1)2fsim(I2)2F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}1): Controls the capacity of CIS to encode cooking variability. Increasing Fcul(I1,I2)=cos(fsim(I1),fsim(I2))=fsim(I1)fsim(I2)fsim(I1)2fsim(I2)2F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}2 beyond 128 provides diminishing returns relative to computational cost.
  • Optimizer Configuration: Recommended learning rate for Fcul(I1,I2)=cos(fsim(I1),fsim(I2))=fsim(I1)fsim(I2)fsim(I1)2fsim(I2)2F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}3 is %%%%53I2I_253%%%%54withgradualdecay;aggressivedecaycancollapseembeddingspreadandreducemetriceffectiveness.</li><li><strong>InferenceSmoothing:</strong>Amovingaveragewindowof35framesisrobustagainsttransientnoise;shorterwindowsincreasereactivitybutincreaseriskofearlyornoisytriggereventsduetoocclusionorlightingfluctuation.</li></ul><h2class=paperheadingid=comparativequantitativeanalysis>5.ComparativeQuantitativeAnalysis</h2><p>CISprovidessignificantlygreaterdiscriminationamongcookingstagesthanSSIMorLPIPS.Onatestsetof715imagepairs:ComparisonSSIMLPIPSCISRawvsBasic0.7220.2090.519RawvsStandard0.6630.2350.366RawvsExtended0.6280.2600.214</p><p>SSIMandLPIPSspanonly with gradual decay; aggressive decay can collapse embedding spread and reduce metric effectiveness.</li> <li><strong>Inference Smoothing:</strong> A moving average window of 3–5 frames is robust against transient noise; shorter windows increase reactivity but increase risk of early or noisy trigger events due to occlusion or lighting fluctuation.</li> </ul> <h2 class='paper-heading' id='comparative-quantitative-analysis'>5. Comparative Quantitative Analysis</h2> <p>CIS provides significantly greater discrimination among cooking stages than SSIM or LPIPS. On a test set of 715 image pairs: | Comparison | SSIM | LPIPS | CIS | |---------------------|-------|--------|-------| | Raw vs Basic | 0.722 | 0.209 | 0.519 | | Raw vs Standard | 0.663 | 0.235 | 0.366 | | Raw vs Extended | 0.628 | 0.260 | 0.214 |</p> <p>SSIM and LPIPS span only F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$60.1–0.15 between raw and cooked states, while CIS covers $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$70.8, enabling clear separation of doneness stages.

    Ablation analysis further indicates:

    • Full model (with CIS): FID = 52.18, LPIPS = 0.2145
    • Without $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$8: FID = 54.98 (+5.3%), LPIPS = 0.2310 (+7.8%)

    On dataset-wide baselines:

    • Pix2Pix (per-state): FID = 153.00
    • Pix2Pix-Turbo: FID = 75.42
    • With CIS: FID = 52.18 (30% lower than Turbo, 66% lower than Pix2Pix)

    Public datasets (Edge2Shoes, Edge2Handbags) show 27–40% FID reduction compared to strong Pix2Pix baselines (Gupta et al., 21 Nov 2025).

    6. Temporal Dynamics and Case Studies

    Empirical visualizations demonstrate CIS's alignment with culinary change:

    • Raw-to-cooked Progression: Synthetic images progress from pale dough (raw) through golden crust (basic) to deep browning (extended), as observed in Figure 1 of (Gupta et al., 21 Nov 2025).
    • Trajectory Plots: CIS demonstrates a monotonically decreasing profile from 1.0 to ≈0.1 throughout cooking, with inflection points corresponding to major transitions (e.g., rapid surface browning). SSIM/LPIPS remain largely flat, failing to capture these transitions.
    • Stop Detection: Only CIS achieves a distinct peak at the precise chef-annotated doneness; generic metrics do not display reliable extrema.
    • Salmon Steak Example: Raw vs Basic CIS ≈ 0.72, with the CIS peak corresponding to chef-defined doneness; system auto-stop is triggered within one frame of ground truth.
    • Chocolate Chip Cookie Example: Initial rapid CIS drop (0–8 min) as batter sets, followed by gradual flattening; generic metrics do not show sensitivity to these transitions.

    7. Summary and Broader Implications

    Culinary Image Similarity is a learned, culinary-domain–aware metric whose embedding space encodes cooking progression as a continuous, temporally smooth trajectory. Its application enables:

    • Improved image synthesis and visual fidelity as a generator loss.
    • Real-time, chef-aligned doneness monitoring for automated cooking systems.
    • Substantially enhanced discrimination of cooking stages compared to SSIM or LPIPS, yielding both improved FID/LPIPS scores and higher practical interpretability for culinary tasks (Gupta et al., 21 Nov 2025).

    This suggests that domain-adapted similarity metrics such as CIS provide critical advantages in generative and monitoring applications where visual semantics evolve along constrained, meaning-rich trajectories, and generic perceptual metrics are insufficiently discriminative.

    Definition Search Book Streamline Icon: https://streamlinehq.com
    References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Culinary Image Similarity (CIS) Metric.