Culinary Image Similarity (CIS) Metric
- Culinary Image Similarity (CIS) is a domain-adapted metric that quantifies cooking progress by mapping food images into a learned embedding space using a Siamese network.
- CIS is integrated as a generative loss in image synthesis, combining adversarial, perceptual, and CIS losses to enforce realistic cooking transitions.
- Empirical evaluations show CIS outperforms generic metrics like SSIM and LPIPS by clearly separating doneness stages and enabling real-time culinary monitoring.
The Culinary Image Similarity (CIS) metric is a domain-specific visual similarity measure designed to quantify the progression of food cooking in images and to serve as both a generative loss and a real-time cooking progress monitor. Introduced in the context of real-time cooked food image synthesis and doneness assessment, CIS leverages a learned embedding space to model the nuanced visual evolution of food during preparation, enabling applications in conditional generation and automated monitoring on edge devices (Gupta et al., 21 Nov 2025).
1. Mathematical Formulation
Let denote a Siamese network-based embedding function that maps an RGB image to a -dimensional unit-norm vector. The CIS score between two images and is the cosine similarity of their embeddings: Given that fsim outputs -normalized vectors, this reduces to: where .
CIS is used in training a conditional image generator: for generated image and ground-truth 0,
1
This loss is incorporated with adversarial (2) and perceptual (3) losses: 4 Here, 5, 6, and 7 are tunable weights.
For learning 8 itself, every cooking session comprises 9 frames 0 sampled, for example, every 30 seconds. Each pair 1 is assigned a temporal proximity target: 2 Training minimizes MSE between predicted and true similarities: 3
2. Domain Motivation and Theoretical Justification
CIS explicitly learns a food-cooking–aware visual similarity, diverging from generic metrics like SSIM or LPIPS. It is trained on real cooking session sequences, enabling the embedding space to encode visual cues (color, texture, and structure) that typify cooking transitions such as “doneness.” Supervising embeddings with 4 forces the trajectory of cooking states to evolve smoothly through the embedding space, mapping early raw frames near each other and fully cooked frames as maximally distinct.
This domain-adapted design has two core uses:
- As a generative loss, 5 acts as a regularizer, enforcing culinary plausibility by matching synthetic cooking progressions to real ones.
- At inference, CIS functions as a “doneness gauge”: comparing a current frame to a user-selected target state yields a progress score that tracks the underlying culinary transformation.
3. Implementation and Training Protocols
Siamese Network Architecture and Training:
- Backbone: EfficientNet-B1 with random initialization.
- Projection: Two-layer MLP (6), followed by 7 normalization.
- Loss: MSE between predicted CIS and temporal target 8.
- Optimizer: Adam (learning rate 904, weight decay %%%%33%%%%25), step decay factor 3 every 10 epochs.
- Data Augmentation: Random rotations (4), horizontal flips.
- Epochs: 100.
Generator Integration:
- Generator: Text-conditioned U-Net (8 million parameters), with FiLM modulation conditioned on recipe and cooking state embeddings.
- Loss Weights: 5 (adversarial), 6 (LPIPS), 7 (CIS).
- Batch size: 1.
- Optimizer: Adam (learning rate 894, 0), 100 epochs (50 at constant lr, 50 with linear decay).
Real-Time Inference and Monitoring:
- User selects a reference target 1 from multiple generated states.
- Oven camera streams frames 2 at 30-second intervals.
- On-device compute: 3 in 0.3 seconds on a 5 TOPS NPU.
- Maintain a sliding window (3–5 frames, 90–150 seconds) and detect peaks in 4 to trigger stop cooking signals at optimal doneness.
4. Hyperparameterization and Tuning Strategies
- Loss Weighting (5, 6, 7): Empirically, equal weighting of LPIPS and CIS (8) achieves optimal trade-off between FID/LPIPS and semantic faithfulness. Adjusting 9 in 0 allows trade-offs: higher values prioritize temporal consistency, lower values favor fine texture detail.
- Embedding Dimension (1): Controls the capacity of CIS to encode cooking variability. Increasing 2 beyond 128 provides diminishing returns relative to computational cost.
- Optimizer Configuration: Recommended learning rate for 3 is %%%%5353%%%%54F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$60.1–0.15 between raw and cooked states, while CIS covers $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$70.8, enabling clear separation of doneness stages.
Ablation analysis further indicates:
- Full model (with CIS): FID = 52.18, LPIPS = 0.2145
- Without $F_{\text{cul}}(I_1, I_2) = \cos(\mathrm{fsim}(I_1), \mathrm{fsim}(I_2)) = \frac{\mathrm{fsim}(I_1) \cdot \mathrm{fsim}(I_2)}{\|\mathrm{fsim}(I_1)\|_2 \cdot \|\mathrm{fsim}(I_2)\|_2}$8: FID = 54.98 (+5.3%), LPIPS = 0.2310 (+7.8%)
On dataset-wide baselines:
- Pix2Pix (per-state): FID = 153.00
- Pix2Pix-Turbo: FID = 75.42
- With CIS: FID = 52.18 (30% lower than Turbo, 66% lower than Pix2Pix)
Public datasets (Edge2Shoes, Edge2Handbags) show 27–40% FID reduction compared to strong Pix2Pix baselines (Gupta et al., 21 Nov 2025).
6. Temporal Dynamics and Case Studies
Empirical visualizations demonstrate CIS's alignment with culinary change:
- Raw-to-cooked Progression: Synthetic images progress from pale dough (raw) through golden crust (basic) to deep browning (extended), as observed in Figure 1 of (Gupta et al., 21 Nov 2025).
- Trajectory Plots: CIS demonstrates a monotonically decreasing profile from 1.0 to ≈0.1 throughout cooking, with inflection points corresponding to major transitions (e.g., rapid surface browning). SSIM/LPIPS remain largely flat, failing to capture these transitions.
- Stop Detection: Only CIS achieves a distinct peak at the precise chef-annotated doneness; generic metrics do not display reliable extrema.
- Salmon Steak Example: Raw vs Basic CIS ≈ 0.72, with the CIS peak corresponding to chef-defined doneness; system auto-stop is triggered within one frame of ground truth.
- Chocolate Chip Cookie Example: Initial rapid CIS drop (0–8 min) as batter sets, followed by gradual flattening; generic metrics do not show sensitivity to these transitions.
7. Summary and Broader Implications
Culinary Image Similarity is a learned, culinary-domain–aware metric whose embedding space encodes cooking progression as a continuous, temporally smooth trajectory. Its application enables:
- Improved image synthesis and visual fidelity as a generator loss.
- Real-time, chef-aligned doneness monitoring for automated cooking systems.
- Substantially enhanced discrimination of cooking stages compared to SSIM or LPIPS, yielding both improved FID/LPIPS scores and higher practical interpretability for culinary tasks (Gupta et al., 21 Nov 2025).
This suggests that domain-adapted similarity metrics such as CIS provide critical advantages in generative and monitoring applications where visual semantics evolve along constrained, meaning-rich trajectories, and generic perceptual metrics are insufficiently discriminative.
References (1)