Papers
Topics
Authors
Recent
2000 character limit reached

SeSS Metric: Semantic Image Similarity

Updated 7 November 2025
  • SeSS Metric is a semantic similarity measure defined to assess whether the semantic content of images is preserved after processing by comparing scene graphs.
  • It uses deep learning for scene graph generation and iterative graph matching to align object and relationship similarities with human semantic judgment.
  • The metric outperforms traditional methods by focusing on meaning rather than pixel-level details, making it robust for evaluating semantic communication and imaging pipelines.

The Semantic Similarity Score (SeSS) is a metric designed to quantify the semantic-level similarity between images, specifically to address the shortcomings of traditional low-level similarity metrics in the context of visual semantic communication systems. It evaluates whether the meaning or semantic content of images has been preserved after processing pipelines such as compression, transmission, or generative modeling. SeSS is constructed using a combination of deep learning-based scene graph generation, iterative graph matching, and manual hyperparameter calibration to align with human semantic judgment. The metric is structured, interpretable, and robust to transformations that do not affect semantic content, enabling fair evaluation across both traditional and semantic communication architectures.

1. Motivation and Context

Visual semantic communication systems extract, compress, transmit, and reconstruct images at the semantic level, in contrast to traditional methods that operate at the symbol or pixel level. Existing image similarity metrics—including Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Multi-Scale Structural Similarity (MS-SSIM)—are sensitive to pixel- and structure-level differences but fail to capture semantic preservation. Modern semantic metrics such as LPIPS, ViTScore, and ClipScore partially bridge this gap but fall short in key areas: LPIPS and ViTScore lack holistic semantic comprehension, ClipScore may confuse semantics due to object role ambiguities, and all are insufficiently robust to transformations that are not semantically meaningful.

SeSS addresses these limitations by shifting the evaluation focus from raw pixel or patch similarity to structured graph-based semantic matching, directly reflecting how humans parse and compare scene content. Its explicit modeling of objects and their relationships enables robust, interpretable, and human-aligned semantic assessment.

2. Methodological Design

Scene Graph Construction

SeSS begins by segmenting each image into object-level masks using the Segment Anything Model (SAM). For an input image II:

Pr(MI)=SAM(I)\Pr(M | I) = \text{SAM}(I)

A PSG-based Scene Graph Generation (SGG) model then predicts semantic relationships among objects to build a structured scene graph, GG, with nodes for objects and edges for relationships:

Pr(GI)=Pr(MI)Pr(RM,I)\Pr(G|I) = \Pr(M|I) \Pr(R|M, I)

Iterative Graph Matching

Given two images, I1I_1 and I2I_2, each is converted into a scene graph (G1G_1, G2G_2). The initial node-pairwise visual similarity matrix LL is computed using ClipScore over masked object regions:

Li,j=ClipScore(I1[mask(ui)],I2[mask(vj)])L_{i,j} = \text{ClipScore}(I_1[\text{mask}(u_i)], I_2[\text{mask}(v_j)])

An iterative process refines similarities by incorporating neighborhood context and object relations. For each pair of nodes, their similarity update involves a convex combination of previous similarity and a relation-aware neighborhood score using ClipScore between relation labels:

L^k,l=αLk,l+(1α)Rr1k,r2l\hat{L}_{k,l} = \alpha L_{k,l} + (1-\alpha)R_{r_{1k}, r_{2l}}

Ri,j=ClipScore(ri,rj)R_{i,j} = \text{ClipScore}(r_i, r_j)

The Hungarian Algorithm (also known as the KM algorithm) identifies maximum bipartite matching in the refined similarity matrix, combining node and relational consistency:

Lu,v=(1β)Lu,v+βKM(L^)L_{u,v}' = (1-\beta)L_{u,v} + \beta KM(\hat{L})

This procedure propagates relational influence through iterations.

Global Similarity Calculation and Aggregation

The final SeSS metric fuses the maximum bipartite matching score between objects with a global image-level semantic similarity (ClipScore):

SeSS=(1γ)KM(L)+γimage_score\text{SeSS} = (1-\gamma) \cdot KM(L) + \gamma \cdot \text{image\_score}

where

  • KM(L)KM(L) is the globally optimized matching score,
  • image_score=ClipScore(I1,I2)\text{image\_score} = \text{ClipScore}(I_1, I_2),
  • γ\gamma is a tunable hyperparameter.

Node importance is weighted by visual significance, e.g., via saliency prediction or localized pixel variation:

obj_imp(oi)=Iimp[mask(oi)]obj\_imp(o_i) = \sum I_{imp}[\text{mask}(o_i)]

The weights are normalized so their sum is 1.

3. Manual Annotation and Calibration

A dataset of 100,000 image pairs, each comprising an original and two transformed variants, was manually annotated with semantic similarity scores by human raters. These annotations served to fine-tune SeSS hyperparameters (α,β,γ\alpha, \beta, \gamma, and node importance factors) using random hyperparameter search to maximize alignment with human assessment. Empirical findings indicate that object-matching is more important to human raters than relationship-matching, influencing parameter selection such that direct object correspondences are prioritized.

Manual annotation ensures that output values from SeSS are well-calibrated against human judgment, as reflected in observed cross-validated correlational analyses (e.g., Fig. 6 in the source).

4. Evaluation Protocols and Empirical Performance

Datasets and Scenarios

  • COCO2017: The primary evaluation corpus, including images processed by both traditional and semantic communication systems.
  • Synthetic pairs: Generated via varying compression ratios, transmission through noisy channels, generative model synthesis with controlled noise, and application of non-semantic transformations (e.g., rotations, color shifts).

Baseline Comparisons

  • Low-level: MSE, PSNR, SSIM, MS-SSIM.
  • Semantic-level: LPIPS, ViTScore, ClipScore.

Experimental Configurations

  1. Compression Ratio Variation: JPEG, JPEG2000, and semantic encoder (LSCI) across a bpp range.
  2. Channel SNR Variation: Image transmission over simulated noisy channels, both traditional (JPEG+LDPC) and semantic (LSCI+JSCC).
  3. Generation Noise: Images synthesized by models such as Unclip, with noise injected to alter semantic similarity.
  4. Non-Semantic Transformations: Translations, rotations, reflections, and color adjustments to probe robustness.

Observed Results

  • SeSS closely tracks human semantic similarity perception, with strong statistical correlation to manual annotations under all tested conditions.
  • Unlike PSNR and MS-SSIM, SeSS remains stable under non-semantic transformations, dropping only when semantics become ambiguous or unrecognizable.
  • In low bitrate or low SNR settings, semantic communication systems (e.g., LSCI) maintain higher SeSS scores than traditional systems, reflecting superior semantic preservation—a phenomenon only detected by SeSS.
  • Object and relation matching in SeSS outputs are structurally interpretable, enabling visualization of correspondences (see Fig. 14).
  • For images lacking salient objects or scene structure, SeSS is less effective, and the metric’s [0,1] range is not densely covered.
Metric Level Evaluated Human Alignment Interpretable Robust to Non-Semantic Changes Semantic Relationships?
MSE Pixel No Yes No No
PSNR Pixel No Yes No No
MS-SSIM Structure Some Yes No No
LPIPS Patch/Semantic Better Partially Partially No
ViTScore Patch/Global Semantic Better Partially Partially No
ClipScore Global Semantic High No Yes Weak (confuses roles)
SeSS Object/Relation Graph Highest Yes Yes Yes

5. Alignment with Human Perception and Interpretability

SeSS is explicitly designed to reflect human semantic reasoning by decomposing images into objects and their interrelations, mimicking the cognitive process of scene parsing. Manual hyperparameter tuning ensures congruence with human-annotated similarity, evidencing that SeSS outputs agree with human judgment in diverse scenarios.

A distinguishing feature is interpretability: SeSS’s graph matching process enables end-users to inspect which specific objects or relationships contribute most strongly to similarity or dissimilarity, a property lacking in embedding-based semantic metrics.

6. Applications and Implications

SeSS is tailored for evaluating semantic image communication architectures, supporting both objective measurement and fair system comparison when semantic preservation is paramount. Practical uses encompass:

  • Benchmarking semantic and traditional communication/modification pipelines for images at varied network/bandwidth conditions.
  • Objective function or evaluation benchmark for semantic communication system development, tuning, and validation.
  • Robustness evaluation in generative modeling and image transformation tasks.
  • Semantic-level image retrieval, question answering, and other applications requiring scene-level meaning assessment.

A plausible implication is that SeSS, by providing interpretable and human-aligned scores, may facilitate more transparent and robust pipeline optimization, especially in contexts where meaning preservation overrides pixel-level fidelity.

7. Limitations and Future Prospects

SeSS’s effectiveness diminishes when applied to images with few or no objects or inherently ambiguous semantics (e.g., noise textures), reflecting its dependence on successful scene graph decomposition. The dynamic range of metric scores is sometimes compressed, indicating underutilization of the [0,1] output interval in specific settings. Future research may extend SeSS to handle such edge cases, optimize computational efficiency, or augment graph extraction for scenes with weak structure.

In summary, SeSS is a semantic similarity metric synthesizing scene graph generation, iterative object-relation matching, and human-centered calibration, yielding scores that closely mirror human perceptions of semantic similarity, and is robust to non-semantic variation, distinguishing itself as a state-of-the-art tool for the semantic assessment of image similarity in communication and generation systems.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SeSS Metric.