SAGE: A Realistic Benchmark for Semantic Understanding (2509.21310v1)

Published 25 Sep 2025 in cs.AI

Abstract: As LLMs achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.

Summary

The paper introduces SAGE, a benchmark that rigorously tests semantic understanding through adversarial, noisy, and human-aligned evaluations.
It compares state-of-the-art embedding models with classical metrics, highlighting their strengths in human judgment alignment and weaknesses under text perturbations.
Results reveal trade-offs between robustness and semantic fidelity, urging the use of defensive architectures and hybrid approaches for real-world deployments.

SAGE: A Comprehensive Benchmark for Semantic Understanding

The SAGE benchmark introduces a rigorous, multi-faceted evaluation protocol for semantic understanding in text similarity models, addressing critical gaps in existing benchmarks. By systematically probing both embedding-based and classical metrics across adversarial, noisy, and human-aligned scenarios, SAGE provides a realistic assessment of model robustness and semantic fidelity, with direct implications for production deployment and future research in semantic representation.

Motivation and Benchmark Design

Traditional benchmarks such as MTEB and BEIR primarily evaluate models under idealized conditions, focusing on retrieval and clustering tasks with clean data. These frameworks fail to capture the nuanced requirements of semantic understanding in real-world applications, where robustness to noise, adversarial perturbations, and alignment with human judgment are paramount. SAGE is designed to address these deficiencies by evaluating models across five challenging categories:

Human Preference Alignment: Measures correlation and predictive accuracy with human judgments using multi-dimensional ratings and pairwise preferences.
Transformation Robustness: Assesses resilience to superficial and semantic text perturbations, including character-level noise and meaning-altering transformations.
Information Sensitivity: Quantifies the ability to detect and proportionally respond to semantic degradation via controlled content insertion and removal.
Clustering Performance: Evaluates preservation of categorical structure in unsupervised settings using V-measure across diverse domains.
Retrieval Robustness: Tests effectiveness under adversarially augmented corpora, measuring retention of NDCG@10 across 18 perturbation types.

This holistic approach exposes model limitations that are invisible to conventional benchmarks, providing a more accurate reflection of production readiness.

Experimental Protocol and Implementation

SAGE evaluates nine models and metrics, including five state-of-the-art embedding models (OpenAI text-embedding-3-small/large, Cohere embed-v4.0, Voyage-3-large, Gemini-embedding-001) and four classical metrics (Levenshtein Ratio, ROUGE, Jaccard Similarity, BM25). All embedding models use cosine similarity for downstream tasks. Scores for each category are normalized to [0,1], with the overall SAGE score computed as the unweighted mean.

Human Preference Alignment

Datasets: OpenAI's summarize_from_feedback, with 193,841 rows of multi-dimensional ratings and 64,832 pairwise preferences.
Metrics: Pearson correlation for multi-dimensional ratings (normalized to [0,1]), and classification accuracy, precision, recall, F1 for pairwise preferences.
Findings: Embedding models outperform classical metrics in aligning with human judgments (e.g., text-embedding-3-large achieves 0.682 vs. BM25 at 0.591).

Transformation Robustness

Datasets: BillSum, CNN/DailyMail, PubMed, totaling 48,531 document-summary pairs.
Perturbations: Three superficial (random capitalization, character deletion, numerization) and three semantic (negation toggling, sentence shuffling, word shuffling).
Evaluation: Ordinal relationships among similarity scores; robustness score is the percentage of instances maintaining the expected hierarchy.
Findings: Classical metrics (Levenshtein Ratio: 0.333) outperform embeddings (max 0.319) in robustness, with embedding models exhibiting extreme brittleness (e.g., text-embedding-3-small: 0.011).

Information Sensitivity

Datasets: Six domains, 457,420 documents.
Perturbations: Needle-in-haystack insertion and token-based removal at varying proportions and positions.
Metric: Sensitivity score based on mean absolute error from theoretical degradation curve.
Findings: Jaccard Similarity achieves 0.905, outperforming all embeddings (max 0.794), indicating superior detection of semantic degradation.

Clustering Performance

Datasets: 11 MTEB clustering datasets.
Metric: V-measure, combining homogeneity and completeness.
Findings: Embedding models dominate (text-embedding-3-small: 0.483 vs. BM25: 0.209), confirming their strength in unsupervised semantic structuring.

Retrieval Robustness

Datasets: Full BEIR benchmark, 18 datasets.
Perturbations: 18 adversarial transformations per document.
Metric: Harmonic mean of NDCG@10 retention ratios.
Findings: Embedding models outperform classical metrics, but even the best (text-embedding-3-large: 0.457) retains less than half its effectiveness under noise.

Results and Performance Trade-offs

SAGE reveals that no single model or metric excels across all dimensions. Embedding models lead in human alignment, clustering, and retrieval, but classical metrics are superior in information sensitivity and robustness. Notably, text-embedding-3-small achieves the highest clustering score (0.483) but is the most brittle under transformation (0.011). The best classical metric (Jaccard: 0.423) trails all embeddings in overall score, but outperforms them by 14% in sensitivity tasks, directly contradicting MTEB rankings.

These findings highlight the benchmark-production gap: models that excel on clean academic datasets may fail catastrophically in noisy, adversarial environments. SAGE demonstrates that aggregate scores are insufficient for model selection; task-specific trade-offs must be considered, and defensive architectures (e.g., data cleaning, reranking, filtering) are necessary for robust deployment.

Implications and Future Directions

SAGE's comprehensive evaluation protocol has significant implications for both research and practice:

Benchmarking: Future benchmarks must incorporate real-world corruptions, adversarial augmentation, and production constraints (latency, memory) to provide meaningful guidance for deployment.
Model Selection: Practitioners should interpret published scores as upper bounds, deploy with safeguards, and select models based on application-specific requirements and data characteristics.
Research: The observed brittleness and trade-offs motivate development of hybrid approaches, ensemble methods, and new architectures that balance semantic fidelity with robustness.
Evaluation Methodology: SAGE's adversarial and human-aligned tasks set a new standard for semantic evaluation, encouraging the community to move beyond narrow retrieval and clustering metrics.

Conclusion

SAGE establishes a new paradigm for semantic understanding evaluation, exposing critical limitations and trade-offs in current models and metrics. Its multi-dimensional, adversarially robust protocol provides a realistic assessment of production readiness, challenging the field to develop models and benchmarks that reflect the complexity of real-world semantic tasks. Future work should extend SAGE with broader data diversity, more sophisticated corruptions, and integration of operational constraints, fostering a more rigorous and balanced approach to semantic evaluation in AI.