SeSS Metric: Semantic Image Similarity

Updated 7 November 2025

SeSS Metric is a semantic similarity measure defined to assess whether the semantic content of images is preserved after processing by comparing scene graphs.
It uses deep learning for scene graph generation and iterative graph matching to align object and relationship similarities with human semantic judgment.
The metric outperforms traditional methods by focusing on meaning rather than pixel-level details, making it robust for evaluating semantic communication and imaging pipelines.

The Semantic Similarity Score (SeSS) is a metric designed to quantify the semantic-level similarity between images, specifically to address the shortcomings of traditional low-level similarity metrics in the context of visual semantic communication systems. It evaluates whether the meaning or semantic content of images has been preserved after processing pipelines such as compression, transmission, or generative modeling. SeSS is constructed using a combination of deep learning-based scene graph generation, iterative graph matching, and manual hyperparameter calibration to align with human semantic judgment. The metric is structured, interpretable, and robust to transformations that do not affect semantic content, enabling fair evaluation across both traditional and semantic communication architectures.

1. Motivation and Context

Visual semantic communication systems extract, compress, transmit, and reconstruct images at the semantic level, in contrast to traditional methods that operate at the symbol or pixel level. Existing image similarity metrics—including Mean Squared Error (MSE), Peak Signal-to-Noise Ratio (PSNR), and Multi-Scale Structural Similarity (MS-SSIM)—are sensitive to pixel- and structure-level differences but fail to capture semantic preservation. Modern semantic metrics such as LPIPS, ViTScore, and ClipScore partially bridge this gap but fall short in key areas: LPIPS and ViTScore lack holistic semantic comprehension, ClipScore may confuse semantics due to object role ambiguities, and all are insufficiently robust to transformations that are not semantically meaningful.

SeSS addresses these limitations by shifting the evaluation focus from raw pixel or patch similarity to structured graph-based semantic matching, directly reflecting how humans parse and compare scene content. Its explicit modeling of objects and their relationships enables robust, interpretable, and human-aligned semantic assessment.

2. Methodological Design

Scene Graph Construction

SeSS begins by segmenting each image into object-level masks using the Segment Anything Model (SAM). For an input image $I$ :

$\Pr(M | I) = \text{SAM}(I)$

A PSG-based Scene Graph Generation (SGG) model then predicts semantic relationships among objects to build a structured scene graph, $G$ , with nodes for objects and edges for relationships:

$\Pr(G|I) = \Pr(M|I) \Pr(R|M, I)$

Iterative Graph Matching

Given two images, $I_1$ and $I_2$ , each is converted into a scene graph ( $G_1$ , $G_2$ ). The initial node-pairwise visual similarity matrix $L$ is computed using ClipScore over masked object regions:

$L_{i,j} = \text{ClipScore}(I_1[\text{mask}(u_i)], I_2[\text{mask}(v_j)])$

An iterative process refines similarities by incorporating neighborhood context and object relations. For each pair of nodes, their similarity update involves a convex combination of previous similarity and a relation-aware neighborhood score using ClipScore between relation labels:

$\hat{L}_{k,l} = \alpha L_{k,l} + (1-\alpha)R_{r_{1k}, r_{2l}}$

$R_{i,j} = \text{ClipScore}(r_i, r_j)$

The Hungarian Algorithm (also known as the KM algorithm) identifies maximum bipartite matching in the refined similarity matrix, combining node and relational consistency:

$L_{u,v}' = (1-\beta)L_{u,v} + \beta KM(\hat{L})$

This procedure propagates relational influence through iterations.

Global Similarity Calculation and Aggregation

The final SeSS metric fuses the maximum bipartite matching score between objects with a global image-level semantic similarity (ClipScore):

$\text{SeSS} = (1-\gamma) \cdot KM(L) + \gamma \cdot \text{image\_score}$

where

$KM(L)$ is the globally optimized matching score,
$\text{image\_score} = \text{ClipScore}(I_1, I_2)$ ,
$\gamma$ is a tunable hyperparameter.

Node importance is weighted by visual significance, e.g., via saliency prediction or localized pixel variation:

$obj\_imp(o_i) = \sum I_{imp}[\text{mask}(o_i)]$

The weights are normalized so their sum is 1.

3. Manual Annotation and Calibration

A dataset of 100,000 image pairs, each comprising an original and two transformed variants, was manually annotated with semantic similarity scores by human raters. These annotations served to fine-tune SeSS hyperparameters ( $\alpha, \beta, \gamma$ , and node importance factors) using random hyperparameter search to maximize alignment with human assessment. Empirical findings indicate that object-matching is more important to human raters than relationship-matching, influencing parameter selection such that direct object correspondences are prioritized.

Manual annotation ensures that output values from SeSS are well-calibrated against human judgment, as reflected in observed cross-validated correlational analyses (e.g., Fig. 6 in the source).

4. Evaluation Protocols and Empirical Performance

Datasets and Scenarios

COCO2017: The primary evaluation corpus, including images processed by both traditional and semantic communication systems.
Synthetic pairs: Generated via varying compression ratios, transmission through noisy channels, generative model synthesis with controlled noise, and application of non-semantic transformations (e.g., rotations, color shifts).

Baseline Comparisons

Low-level: MSE, PSNR, SSIM, MS-SSIM.
Semantic-level: LPIPS, ViTScore, ClipScore.

Experimental Configurations

Compression Ratio Variation: JPEG, JPEG2000, and semantic encoder (LSCI) across a bpp range.
Channel SNR Variation: Image transmission over simulated noisy channels, both traditional (JPEG+LDPC) and semantic (LSCI+JSCC).
Generation Noise: Images synthesized by models such as Unclip, with noise injected to alter semantic similarity.
Non-Semantic Transformations: Translations, rotations, reflections, and color adjustments to probe robustness.

Observed Results

SeSS closely tracks human semantic similarity perception, with strong statistical correlation to manual annotations under all tested conditions.
Unlike PSNR and MS-SSIM, SeSS remains stable under non-semantic transformations, dropping only when semantics become ambiguous or unrecognizable.
In low bitrate or low SNR settings, semantic communication systems (e.g., LSCI) maintain higher SeSS scores than traditional systems, reflecting superior semantic preservation—a phenomenon only detected by SeSS.
Object and relation matching in SeSS outputs are structurally interpretable, enabling visualization of correspondences (see Fig. 14).
For images lacking salient objects or scene structure, SeSS is less effective, and the metric’s [0,1] range is not densely covered.

Metric	Level Evaluated	Human Alignment	Interpretable	Robust to Non-Semantic Changes	Semantic Relationships?
MSE	Pixel	No	Yes	No	No
PSNR	Pixel	No	Yes	No	No
MS-SSIM	Structure	Some	Yes	No	No
LPIPS	Patch/Semantic	Better	Partially	Partially	No
ViTScore	Patch/Global Semantic	Better	Partially	Partially	No
ClipScore	Global Semantic	High	No	Yes	Weak (confuses roles)
SeSS	Object/Relation Graph	Highest	Yes	Yes	Yes

5. Alignment with Human Perception and Interpretability

SeSS is explicitly designed to reflect human semantic reasoning by decomposing images into objects and their interrelations, mimicking the cognitive process of scene parsing. Manual hyperparameter tuning ensures congruence with human-annotated similarity, evidencing that SeSS outputs agree with human judgment in diverse scenarios.

A distinguishing feature is interpretability: SeSS’s graph matching process enables end-users to inspect which specific objects or relationships contribute most strongly to similarity or dissimilarity, a property lacking in embedding-based semantic metrics.

6. Applications and Implications

SeSS is tailored for evaluating semantic image communication architectures, supporting both objective measurement and fair system comparison when semantic preservation is paramount. Practical uses encompass:

Benchmarking semantic and traditional communication/modification pipelines for images at varied network/bandwidth conditions.
Objective function or evaluation benchmark for semantic communication system development, tuning, and validation.
Robustness evaluation in generative modeling and image transformation tasks.
Semantic-level image retrieval, question answering, and other applications requiring scene-level meaning assessment.

A plausible implication is that SeSS, by providing interpretable and human-aligned scores, may facilitate more transparent and robust pipeline optimization, especially in contexts where meaning preservation overrides pixel-level fidelity.

7. Limitations and Future Prospects

SeSS’s effectiveness diminishes when applied to images with few or no objects or inherently ambiguous semantics (e.g., noise textures), reflecting its dependence on successful scene graph decomposition. The dynamic range of metric scores is sometimes compressed, indicating underutilization of the [0,1] output interval in specific settings. Future research may extend SeSS to handle such edge cases, optimize computational efficiency, or augment graph extraction for scenes with weak structure.

In summary, SeSS is a semantic similarity metric synthesizing scene graph generation, iterative object-relation matching, and human-centered calibration, yielding scores that closely mirror human perceptions of semantic similarity, and is robust to non-semantic variation, distinguishing itself as a state-of-the-art tool for the semantic assessment of image similarity in communication and generation systems.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to SeSS Metric.