- The paper introduces SAGE, a benchmark that rigorously tests semantic understanding through adversarial, noisy, and human-aligned evaluations.
- It compares state-of-the-art embedding models with classical metrics, highlighting their strengths in human judgment alignment and weaknesses under text perturbations.
- Results reveal trade-offs between robustness and semantic fidelity, urging the use of defensive architectures and hybrid approaches for real-world deployments.
SAGE: A Comprehensive Benchmark for Semantic Understanding
The SAGE benchmark introduces a rigorous, multi-faceted evaluation protocol for semantic understanding in text similarity models, addressing critical gaps in existing benchmarks. By systematically probing both embedding-based and classical metrics across adversarial, noisy, and human-aligned scenarios, SAGE provides a realistic assessment of model robustness and semantic fidelity, with direct implications for production deployment and future research in semantic representation.
Motivation and Benchmark Design
Traditional benchmarks such as MTEB and BEIR primarily evaluate models under idealized conditions, focusing on retrieval and clustering tasks with clean data. These frameworks fail to capture the nuanced requirements of semantic understanding in real-world applications, where robustness to noise, adversarial perturbations, and alignment with human judgment are paramount. SAGE is designed to address these deficiencies by evaluating models across five challenging categories:
- Human Preference Alignment: Measures correlation and predictive accuracy with human judgments using multi-dimensional ratings and pairwise preferences.
- Transformation Robustness: Assesses resilience to superficial and semantic text perturbations, including character-level noise and meaning-altering transformations.
- Information Sensitivity: Quantifies the ability to detect and proportionally respond to semantic degradation via controlled content insertion and removal.
- Clustering Performance: Evaluates preservation of categorical structure in unsupervised settings using V-measure across diverse domains.
- Retrieval Robustness: Tests effectiveness under adversarially augmented corpora, measuring retention of NDCG@10 across 18 perturbation types.
This holistic approach exposes model limitations that are invisible to conventional benchmarks, providing a more accurate reflection of production readiness.
Experimental Protocol and Implementation
SAGE evaluates nine models and metrics, including five state-of-the-art embedding models (OpenAI text-embedding-3-small/large, Cohere embed-v4.0, Voyage-3-large, Gemini-embedding-001) and four classical metrics (Levenshtein Ratio, ROUGE, Jaccard Similarity, BM25). All embedding models use cosine similarity for downstream tasks. Scores for each category are normalized to [0,1], with the overall SAGE score computed as the unweighted mean.
Human Preference Alignment
- Datasets: OpenAI's summarize_from_feedback, with 193,841 rows of multi-dimensional ratings and 64,832 pairwise preferences.
- Metrics: Pearson correlation for multi-dimensional ratings (normalized to [0,1]), and classification accuracy, precision, recall, F1 for pairwise preferences.
- Findings: Embedding models outperform classical metrics in aligning with human judgments (e.g., text-embedding-3-large achieves 0.682 vs. BM25 at 0.591).
- Datasets: BillSum, CNN/DailyMail, PubMed, totaling 48,531 document-summary pairs.
- Perturbations: Three superficial (random capitalization, character deletion, numerization) and three semantic (negation toggling, sentence shuffling, word shuffling).
- Evaluation: Ordinal relationships among similarity scores; robustness score is the percentage of instances maintaining the expected hierarchy.
- Findings: Classical metrics (Levenshtein Ratio: 0.333) outperform embeddings (max 0.319) in robustness, with embedding models exhibiting extreme brittleness (e.g., text-embedding-3-small: 0.011).
- Datasets: Six domains, 457,420 documents.
- Perturbations: Needle-in-haystack insertion and token-based removal at varying proportions and positions.
- Metric: Sensitivity score based on mean absolute error from theoretical degradation curve.
- Findings: Jaccard Similarity achieves 0.905, outperforming all embeddings (max 0.794), indicating superior detection of semantic degradation.
- Datasets: 11 MTEB clustering datasets.
- Metric: V-measure, combining homogeneity and completeness.
- Findings: Embedding models dominate (text-embedding-3-small: 0.483 vs. BM25: 0.209), confirming their strength in unsupervised semantic structuring.
Retrieval Robustness
- Datasets: Full BEIR benchmark, 18 datasets.
- Perturbations: 18 adversarial transformations per document.
- Metric: Harmonic mean of NDCG@10 retention ratios.
- Findings: Embedding models outperform classical metrics, but even the best (text-embedding-3-large: 0.457) retains less than half its effectiveness under noise.
SAGE reveals that no single model or metric excels across all dimensions. Embedding models lead in human alignment, clustering, and retrieval, but classical metrics are superior in information sensitivity and robustness. Notably, text-embedding-3-small achieves the highest clustering score (0.483) but is the most brittle under transformation (0.011). The best classical metric (Jaccard: 0.423) trails all embeddings in overall score, but outperforms them by 14% in sensitivity tasks, directly contradicting MTEB rankings.
These findings highlight the benchmark-production gap: models that excel on clean academic datasets may fail catastrophically in noisy, adversarial environments. SAGE demonstrates that aggregate scores are insufficient for model selection; task-specific trade-offs must be considered, and defensive architectures (e.g., data cleaning, reranking, filtering) are necessary for robust deployment.
Implications and Future Directions
SAGE's comprehensive evaluation protocol has significant implications for both research and practice:
- Benchmarking: Future benchmarks must incorporate real-world corruptions, adversarial augmentation, and production constraints (latency, memory) to provide meaningful guidance for deployment.
- Model Selection: Practitioners should interpret published scores as upper bounds, deploy with safeguards, and select models based on application-specific requirements and data characteristics.
- Research: The observed brittleness and trade-offs motivate development of hybrid approaches, ensemble methods, and new architectures that balance semantic fidelity with robustness.
- Evaluation Methodology: SAGE's adversarial and human-aligned tasks set a new standard for semantic evaluation, encouraging the community to move beyond narrow retrieval and clustering metrics.
Conclusion
SAGE establishes a new paradigm for semantic understanding evaluation, exposing critical limitations and trade-offs in current models and metrics. Its multi-dimensional, adversarially robust protocol provides a realistic assessment of production readiness, challenging the field to develop models and benchmarks that reflect the complexity of real-world semantic tasks. Future work should extend SAGE with broader data diversity, more sophisticated corruptions, and integration of operational constraints, fostering a more rigorous and balanced approach to semantic evaluation in AI.