Semantic Score (SS) Metrics

Updated 22 December 2025

Semantic Score (SS) is a family of metrics that quantifies semantic similarity, fidelity, or adequacy across modalities like text, image, and speech.
It leverages advanced methods such as fine-tuned transformer regressions, cosine similarity, and graph matching to closely mirror human interpretive judgments.
SS has proven effective in diverse applications including text generation, ASR, and semantic communications by enhancing model evaluation and adaptive control.

Semantic Score (SS) refers to a class of metrics that directly quantify the semantic similarity, fidelity, or adequacy between two representations—textual, visual, or latent—by leveraging models or protocols that operate beyond simple surface or symbol-level comparisons. Unlike purely syntactic or token-overlap measures, SS and its variants focus on aligning with human interpretive judgment at the semantic level, and are implemented in several modalities, including language, vision, speech, and latent domains, using approaches ranging from deep metric regression to graph matching.

1. Mathematical Definitions and Modalities

The term "Semantic Score" applies to several modalities, each with rigorous mathematical construction:

Textual SS (STSScore): For two text strings $x$ and $y$ , the semantic score is produced by a fine-tuned STS-B regression model, $f(x, y)$ , normalized to $[0, 1]$ :

$SS(x,y) = \text{STSScore}(x,y) = \frac{f(x,y)}{5}$

where $f(x, y)$ is the RoBERTa-STS model's output in $[0, 5]$ (Herbold, 2023).

Sentence Embedding Cosine Similarity (SemScore, SeMaScore): For embeddings $E(\hat{r})$ and $E(r)$ of response and reference, or general representations, the SS is:

$SS(\hat{r}, r) = \cos(E(\hat{r}), E(r)) = \frac{E(\hat{r}) \cdot E(r)}{\|E(\hat{r})\|\|E(r)\|}$

This is the core of SemScore for LLM evaluation and the segment-wise module in SeMaScore for ASR (Aynetdinov et al., 30 Jan 2024, Sasindran et al., 15 Jan 2024).

Latent Cosine Similarity (Image Latent Denoising): When comparing a transmitted latent $y$ and a denoised estimate $z_t$ ,

$s_t = \frac{y^T z_t}{\|y\|_2 \|z_t\|_2}$

This guides iterative denoising in semantic communication (Xu et al., 11 Feb 2025).

Graph-based Visual SS (SeSS): For images $I_1, I_2$ , the score is a convex combination of graph matching between scene graphs and a CLIP-based global similarity:

$SeSS(I_1, I_2) = (1-\gamma)GM(G_1, G_2) + \gamma \, \text{ClipScore}(I_1, I_2)$

where $GM$ is an importance-weighted bipartite matching score over nodes and relations (Fan et al., 6 Jun 2024).

Human-Rated Semantics (Video Captioning): The SS ("Semantic Sensibility") for video-caption pairs combines grammar, object, and action sub-scores:

$SS = \frac{1}{N} \sum_{n=1}^{N} S_{n,\textrm{grammar}} \cdot \frac{S_{n,\textrm{element}} + S_{n,\textrm{action}}}{2}$

(Rahman et al., 2020).

2. Model Architectures and Algorithms

A variety of architectures and algorithms realize the semantic score across domains:

Fine-tuned Transformer Regression: STSScore and SemScore use RoBERTa-base or MPNet sentence encoders with task-specific heads. Training employs MSE loss against human-annotated similarity (Herbold, 2023, Aynetdinov et al., 30 Jan 2024).
Segment-wise Matching with Penalty: SeMaScore uses a Levenshtein-based alignment, computes segment- and context-level embeddings with DeBERTa-Large-MNLI, and penalizes similarity by per-segment error rates. Each segment is weighted for importance, producing a robust, human-aligned, and efficient summary similarity (Sasindran et al., 15 Jan 2024).
Latent Denoising with Learned SS Predictor: In semantic communication, an SS predictor network estimates the cosine similarity between ground-truth and current latents, conditioning subsequent residuals and triggering adaptive early stopping (Xu et al., 11 Feb 2025).
Scene Graph Generation and Matching: SeSS uses Segment Anything Model (SAM) for object masks, a panoptic scene-graph (PSG) model for object/relation extraction, CLIP for patch and relation similarity, and iterative importance-weighted graph matching, hyperparameter-tuned to maximize alignment with $100$k human ratings (Fan et al., 6 Jun 2024).
Manual Semantic Decomposition for Evaluation: Video-caption SS decomposes judgment into grammaticality, object precision/recall, and action alignment, annotator-averaged for deterministic scoring (Rahman et al., 2020).

3. Empirical Performance and Human Alignment

Semantic Score variants consistently correlate more strongly with human judgment than traditional or unsupervised metrics. Key results:

Benchmark/Task	Metric	Pearson r	Spearman ρ	Notable Feature	arXiv ref
STS-B	STSScore	0.90	0.89	Highest among baselines	(Herbold, 2023)
Text Gen. (LLM)	SemScore	0.970	0.879 (τ)	Outperforms BERTScore/LLM	(Aynetdinov et al., 30 Jan 2024)
Video Captioning	SS	—	—	Qualitative > BLEU	(Rahman et al., 2020)
Visual SC	SeSS	> all	> all	Best human corr.	(Fan et al., 6 Jun 2024)
ASR	SeMaScore	—	—	r ≈ 0.92 (intent)	(Sasindran et al., 15 Jan 2024)

Qualitative analyses confirm that SS penalizes meaning changes and overlooks benign surface- or noise-level differences, whereas n-gram/patch metrics often misalign with human satisfaction.

4. Adaptive Control and System Optimization

Several frameworks embed SS as a feedback signal to optimize denoising or model selection:

Latent Denoising: SS provides a denoising progress meter. Residuals are conditioned on the current SS, with denoising halted if predicted SS drops, preventing semantic drift. At high SNR, the process terminates quickly, reducing inference steps and latency. The monotonic increase of SS imposes practical stability (Xu et al., 11 Feb 2025).
Reward Signals and Model Selection: In text generation and RLHF pipelines, STSScore or SemScore can directly guide optimization or validation, obviating the need for brittle heuristics (e.g., BLEU thresholds) (Herbold, 2023, Aynetdinov et al., 30 Jan 2024).
Error Sensitivity: In ASR, SeMaScore penalizes segmental substitutions and aligns scoring with error impact, outperforming BERTScore especially in noisy or low-resource regimes (Sasindran et al., 15 Jan 2024).

5. Implementation Protocols and Practical Use

Practical deployment of semantic score metrics leverages standard open-source and pretrained models, often requiring minimal preprocessing:

Text: Python APIs are provided for STSScore/ SemScore, using HuggingFace Transformers models and mean pooling over token embeddings. Scores require only a model checkpoint and tokenizer (Herbold, 2023, Aynetdinov et al., 30 Jan 2024).
ASR: SeMaScore uses a Levenshtein mapping followed by mean-pooled contextual embeddings and cosine similarity; the computational complexity is $O(m+n)$ for similarity (with $m, n$ the sequence lengths), yielding $\sim41\times$ speedup over BERTScore in real datasets (Sasindran et al., 15 Jan 2024).
Visual: SeSS relies on efficient segmentation, PSG scene graph generation, and CLIP scoring of masked regions—parameters jointly tuned against large annotated datasets for robust hyperparameter selection (Fan et al., 6 Jun 2024).
Human Evaluation: For video captioning, explicit rating rubrics for grammar, objects, and actions yield deterministic, consensus-driven SS (Rahman et al., 2020).

6. Limitations and Current Challenges

Although Semantic Score metrics outperform baseline alternatives, several limitations are documented:

Lower Range Compression: SeSS rarely produces scores near zero even for dissimilar images, limiting discrimination in extreme-cases (Fan et al., 6 Jun 2024).
Dependency on Model Representations: In text and ASR, SS quality is bounded by the encoder's pretraining regime; domain shift degrades reliability (Sasindran et al., 15 Jan 2024).
Absence of Formal Guarantees: Latent denoising SS, e.g., monotonic increase, is enforced empirically, but not theoretically (Xu et al., 11 Feb 2025).
Resource and Computational Cost: Visual semantic graph matching incurs higher compute cost than pixel-based metrics due to segmentation and matching steps (Fan et al., 6 Jun 2024).
Manual Annotation Demand: Human-centric SS for video captioning and graph-matching hyperparameter tuning in SeSS require extensive expert annotation (Rahman et al., 2020, Fan et al., 6 Jun 2024).

7. Extensions and Domain-Specific Instantiations

Several domain-specific variants and applications of the semantic score have emerged:

Speech Recognition: SeMaScore tailors SS to be robust in domains with atypical or low-resource speech, incorporating error rates at segment granularity and outperforming BERTScore on disordered and noisy speech tasks (Sasindran et al., 15 Jan 2024).
Instruction-Tuned LLMs: SemScore's embedding-based cosine similarity provides a scalable gold-reference-matching metric with high correlation to human holistic evaluation (Aynetdinov et al., 30 Jan 2024).
Semantic Communication: In end-to-end communication systems, SS directly mediates cross-modality mapping (latents ↔ output), enabling efficiency gains and real-time adaptive processing based on semantic convergence (Xu et al., 11 Feb 2025, Fan et al., 6 Jun 2024).
Video-to-Text: Human-rated SS decomposes evaluation into interpretable subcomponents (grammar, object, action), overcoming BLEU's insensitivity to semantic content (Rahman et al., 2020).

Overall, Semantic Score (SS) has become an essential metric family for evaluating and optimizing models whose goal is preservation or transfer of meaning, information, or perception rather than mere symbol fidelity, with rigorous foundation and increasing adoption across vision, language, speech, and communication systems.