Semantic Textual Similarity Gold Ratings

Updated 9 October 2025

STS gold ratings are human-annotated scores that quantify semantic similarity across words, phrases, and sentences.
They serve as a critical ground truth for evaluating semantic representation models and advance methods like ranking-based annotation and uncertainty modeling.
Innovative protocols now integrate conditional, embedding-based, and automated techniques to enhance reliability, interpretability, and cross-domain applicability.

Semantic Textual Similarity (STS) gold ratings are reference judgments, often represented as continuous or ordinal scores, that quantify the semantic similarity between pairs of linguistic expressions—words, phrases, or sentences—by human annotators. These gold ratings serve as ground truth for evaluating semantic similarity models and are foundational to benchmarking progress in tasks spanning distributional semantics, sentence embedding evaluation, machine translation quality assessment, and meaning-oriented natural language generation. Gold ratings are not merely scalars reflecting human consensus; contemporary methodologies increasingly interrogate their subjectivity, reliability, and interpretability, motivating innovations in annotation design, performance measures, and evaluation protocols across lexical, multimodal, conditional, and uncertainty-aware settings.

1. Historical Evolution and Purpose of STS Gold Ratings

Early STS gold ratings were developed to measure the effectiveness of distributional semantic models in capturing meaning similarity beyond mere topical relatedness or lexical overlap. Datasets such as SimLex-999 (Hill et al., 2014) explicitly focused on quantifying genuine similarity (e.g., “car–bike,” which are similar, vs. “car–petrol,” which are associated but not similar), addressing deficiencies in prior benchmarks like WordSim-353 and MEN where high scores conflated associative relatedness with true semantic resemblance. This drive for more precise gold ratings has been carried into subsequent shared tasks and benchmarks (e.g., SemEval-STS, STS Benchmark (Cer et al., 2017)), which standardize task formulation, scoring scales (commonly 0–5 or Likert intervals), and annotation protocols to facilitate model comparison and advance semantic representation research.

The purpose of STS gold ratings has broadened from evaluating lexical models to benchmarking sentence-level representation, automatic text generation, machine translation, and even user-preference alignment in recommender systems (Laugier et al., 2023). They now underpin the evaluation of multilingual, cross-lingual, domain-adapted (e.g., clinical (Wang et al., 2018)), and multimodal models (Lacalle et al., 2018).

2. Annotation Methodology, Task Design, and Inter-Rater Agreement

Gold rating construction depends critically on annotation methodology, which directly affects reliability and downstream utility. Traditional approaches employed direct rating scales (often scalar or ordinal, e.g., 0–5) for each pair. However, evidence has shown that such scales can be ill-defined, introducing subjectivity, inconsistent comparative judgments, and low inter-annotator agreement (Avraham et al., 2016).

To address these challenges:

Ranking-based methods replace numeric scoring with explicit ordering tasks within target groups, reducing ambiguity and increasing agreement. For example, annotators may rank several candidates for a given target word/sentence, obviating the need to compare unrelated pairs.
Reliability-weighted performance measures further improve the gold standard by penalizing model errors more severely when annotator consensus is high and being more forgiving otherwise. These methods compute metrics such as:

$s(w_t, w_1, w_2) = \delta \cdot (2R_{(w_1, w_2; w_t)} - 1)$

where $R_{(w_1, w_2; w_t)}$ is the annotator preference fraction, and $\delta$ encodes the model’s pairwise prediction (Avraham et al., 2016).

Maximum Difference Scaling (MDS) and drag-and-drop (ordinal) ranking paradigms, as used in word-sentence relatedness evaluation (Glasgow et al., 2016), yield robust, fine-grained gold standards by focusing human annotation on local best/worst or full-sample ranking tasks, showing high cross-task reliability.

Crucially, annotation design should mitigate biases such as forced scoring of inherently binary conditions on continuous scales, ambiguous condition interpretations, and the degree of contextual inference allowed. Rigorous guidelines, pilot reannotation, and removal of invalid cases have been shown to boost annotator consistency (Tu et al., 6 Jun 2024).

3. Gold Ratings in Conditional, Uncertainty-Aware, and Structural Paradigms

Recent work has questioned the adequacy of one-size-fits-all scalar gold ratings, highlighting subjectivity and ambiguity in both annotation and interpretation.

Conditional STS (C-STS): By introducing explicit aspect-oriented conditions (e.g., “the color of the object” or “type of motion”) as free-form prompts, C-STS localizes similarity judgments to defined semantic facets, rather than masking latent weighting over aspects (Deshpande et al., 2023, Tu et al., 6 Jun 2024). This resolves some ambiguity, enabling interpretable decomposition of similarity ratings, but reveals new challenges in condition clarity and annotation reliability, with substantial annotator discrepancy observed in practice.
Uncertainty-Aware STS: The USTS dataset (Wang et al., 2023) asserts that averaging human scores into single gold labels obscures the intrinsic variance and, for contentious examples (σ > 0.5), multimodality in annotation distributions. Analyses show that neither a scalar nor a single Gaussian captures the distribution adequately, prompting calls for gold standards that explicitly encode uncertainty, possibly via full probability distributions or Gaussian Mixture Models.
Typed-Feature Structure (TFS): Linguistically motivated approaches advocate structuring gold ratings over decomposed entity attributes, using weighted combinations across feature similarities:

$S = \sum_i w_i \cdot \text{sim}(\text{feature}_i)$

Here, $w_i$ carries the importance of each feature, reducing annotator subjectivity and promoting reproducibility, particularly in hierarchical or taxonomically rich domains (Tu et al., 6 Jun 2024).

4. Statistical and Embedding-Based Evaluation Measures

Practices around the computation and use of gold ratings have increasingly moved toward embedding-based and probabilistic frameworks:

Cosine Similarity and Sentence Embeddings: Gold ratings are now frequently operationalized as the target for cosine similarity between sentence representations learned via neural models. For sentence vectors $u, v$ , similarity is given by:

$\text{cosine}(u,v) = \frac{u\cdot v}{\|u\|\|v\|}$

Embedding-based scoring (e.g., SemScore (Aynetdinov et al., 30 Jan 2024)) correlates more strongly with human judgments than n-gram overlaps (BLEU, ROUGE) or edit-based metrics, particularly in paraphrastic or high-variance language settings. This holds in both standard STS evaluation and reference-free tasks such as machine translation quality estimation (Sun et al., 11 Jun 2024).

Correlation-Based Loss Functions: To address the contrastive learning plateau (capped at Spearman 0.875), Pcc-tuning (Zhang et al., 14 Jun 2024) introduces direct optimization of Pearson’s correlation between predicted scores and gold ratings:

$\ell_p = 1 - r$

$r =$ Pearson’s $= \frac{cov(X, Y)}{\sigma_X \sigma_Y}$

Incorporation of explicitly correlation-based objectives aligns optimization surfaces with downstream evaluation, allowing models to better approach or even redefine gold standard performance ceilings.

Probabilistic and Distribution Modeling: Multimodal evaluation of human ratings informs both aggregation strategies (e.g., Bayesian or GMM-based) and calibration of uncertainty in model probabilities (Wang et al., 2023), suggesting that future gold ratings may be distributional rather than scalar objects.

5. Domain-Specific, Multimodal, and Application-Centric Gold Ratings

Gold rating methodologies have been adapted for specialized domains and tasks to account for their unique linguistic and semantic characteristics:

Domain-Specific Gold Ratings: In clinical STS (Wang et al., 2018), gold ratings rely on expert annotations and evaluate high-value entity types (e.g., symptoms, diagnoses, procedures), using 6-point ordinal scales with inter-expert reliability measured via weighted Cohen’s Kappa.
Multimodal STS: Visual STS (vSTS) (Lacalle et al., 2018) introduces gold standards that integrate both textual and visual modalities, pairing human-labeled similarity scores for caption pairs with associated images. Correlation between model predictions and gold ratings is highest when both modalities are fused (Pearson $r$ up to 0.78 vs. text-only $r\approx$ 0.7).
Gold Ratings for Generated Text Evaluation: Metrics such as SEMSCORE (Aynetdinov et al., 30 Jan 2024) compare model outputs to gold textual responses using sentence embeddings, achieving strong correlation with human evaluation ( $r=0.970$ in Kendall and Pearson terms). Such embedding-based scoring is now the recommended gold standard for open-domain generative model assessment.
Quality Estimation in MT: In QE tasks, textual similarity between source and output serves as a more dependable gold metric than reference-based BLEU or hter, showing robust correlation to human adequacy judgments in sentence transformers-based evaluations (Sun et al., 11 Jun 2024).

6. Automation, Error Identification, and Gold Rating Quality Control

Automation is increasingly applied to ensure the integrity and scalability of gold rating construction:

LLM-Facilitated Answer Generation and Error Detection: Quality control pipelines transform conditions into QA-style prompts, use LLMs to generate semantically precise answers, cluster these answers, and flag inconsistencies by comparing cluster-based rankings to original labels—achieving F1 scores exceeding 80% for error detection in C-STS annotation (Tu et al., 6 Jun 2024).
Data Generation with LLMs for Model Training: Sim-GPT (Wang et al., 2023) demonstrates that high-quality gold ratings can be semi-automatically generated from GPT-4 outputs, which then serve to fine-tune resource-efficient student models at scale. The method achieves superior downstream STS performance (e.g., $+0.99$ over supervised-SimCSE, $+0.42$ over PromCSE).
Meta-embedding and Ensemble Approaches: Aggregating outputs from diverse pre-trained sentence encoders improves unsupervised gold rating alignment, with GCCA-based meta-embeddings surpassing single-source baseline performance on various benchmarks (Poerner et al., 2019).

7. Challenges, Limitations, and Research Directions

Several outstanding issues and research opportunities shape ongoing STS gold rating development:

Subjectivity and Distributionality: Subjectivity in similarity judgments necessitates modeling rating distributions (especially for contentious cases), rather than relying solely on averaged values (Wang et al., 2023).
Conditionality and Interpretability: While C-STS and TFS-based methods address ambiguity by conditioning, ensuring definition clarity and standardizing condition formulation remains nontrivial (Tu et al., 6 Jun 2024).
Domain Transferability and Multilinguality: STS gold ratings require careful adaptation for reliability in cross-lingual and highly specialized domains due to variance in linguistic phenomena, annotation practices, and evaluation difficulty (Cer et al., 2017, Wang et al., 2018).
Alignment of Optimization and Evaluation Objectives: Direct correlation-based loss functions (e.g., Pearson’s in Pcc-tuning (Zhang et al., 14 Jun 2024)) are now favored to maximize model-gold rating agreement, particularly as traditional binary contrastive losses hit empirical plateaus.
Automation and Scalability: While LLMs and answer generation pipelines scale annotation and labeling, risks remain regarding propagation of LLM biases or format errors (Wang et al., 2023, Tu et al., 6 Jun 2024).
Future Standardization and Reporting: The field is moving toward explicit reporting of annotation protocols, distributional properties, model selection, and underlying linguistic phenomena relevant to gold rating construction, facilitating robust, interpretable, and generalizable benchmarks.

In sum, STS gold ratings have evolved from simple scalar averages of annotator judgments to richly structured, conditional, uncertainty-aware, and automated artifacts designed to precisely anchor the evaluation of semantic models. Modern best practice emphasizes meticulous annotation design, direct optimization of correlation with human judgment, adaptation to domain and linguistic structure, and proactive error control and distributional modeling—ensuring that gold ratings both reflect and enable the state-of-the-art in semantic representation learning.