Semantic Grounding Index (SGI)
- Semantic Grounding Index (SGI) is a geometric metric that quantifies context engagement in RAG systems by measuring angular distances between queries, responses, and contexts.
- SGI leverages a ratio of geodesic distances on the unit hypersphere to distinguish valid, context-grounded responses from hallucinated, semantically lazy outputs.
- Empirical benchmarks across multiple embedding models confirm SGI's efficacy, demonstrating robust discrimination and calibration for hallucination detection in production pipelines.
The Semantic Grounding Index (SGI) is a geometric metric for quantifying context engagement in retrieval-augmented generation (RAG) systems. Defined as the ratio of angular distances in embedding space between a generated response, the original question, and the retrieved context, SGI provides both theoretical and empirical guarantees for detecting hallucinated outputs. Its utility is rooted in the phenomenon of "semantic laziness:" hallucinated responses remain angularly close to the question rather than moving meaningfully toward the context. SGI offers computationally efficient, model-agnostic assessments of grounding and enables calibrated hallucination risk estimates for production RAG pipelines (Marín, 15 Dec 2025).
1. Formal Definition and Geometric Foundations
SGI operates on the unit hypersphere , where RAG queries, retrieved contexts, and generated responses are embedded as , , and , respectively. The natural distance metric for these normalized embeddings is the geodesic (angular) distance: For any RAG triple , the Semantic Grounding Index is defined: Interpretation:
- : The response is angularly closer to context than question (evidence of valid grounding).
- : The response is closer to the question, indicating semantic laziness (hallucination signature).
The metric’s geometric properties are governed by the spherical triangle inequality: Yielding bounds on SGI: 0 As the question-context angle 1 increases, the allowed range of SGI widens, providing better potential discrimination between grounded and hallucinated responses.
2. Empirical Validation and Performance Benchmarks
Empirical analyses were conducted on HaluEval (2), benchmarking SGI across five sentence-transformer embedding models:
- all-mpnet-base-v2 (768d)
- all-MiniLM-L6-v2 (384d)
- bge-base-en-v1.5 (768d)
- e5-base-v2 (768d)
- gte-base (768d)
Key findings are summarized in the table below:
| Metric | Valid Responses | Hallucinated Responses | Effect Size / Correlation |
|---|---|---|---|
| Mean SGI | 1.188 | 0.913 | Cohen’s 3 (large) |
| ROC-AUC (per model) | – | – | 0.776–0.824 (mean 4) |
| Cross-model Pearson 5 | – | – | 6 (off-diagonal mean) |
| Spearman 7 | – | – | 8 |
When stratifying results by question-context angle as predicted by the theoretical bounds, discriminative power increases monotonically with 9:
- Low 0: 1, AUC=0.721
- Medium 2: 3, AUC=0.768
- High 4: 5, AUC=0.832
These results confirm that SGI’s effectiveness grows with divergent question and context embeddings.
3. Subgroup Analyses and Calibration
Subgroup analyses demonstrate robustness and highlight scenarios where SGI is particularly discriminative:
- Question length:
- Short: 6, AUC=0.812
- Medium: 7, AUC=0.781
- Long: 8, AUC=0.714
- Context length:
- All: 9–0, AUC 1–2
- Response length:
- Short: 3, AUC=0.771
- Medium: 4, AUC=0.804
- Long: 5, AUC=0.893
SGI is especially effective for detecting hallucinations on long responses and short, focused questions.
Calibration analysis via reliability diagrams supports the use of SGI as a probabilistic hallucination risk estimator (Expected Calibration Error ECE=0.10)—SGI scores can be min–max normalized to estimate probability. Hallucination rates decline from ~100% in the lowest SGI decile to ~65% in the highest.
4. Distinction from Factual Accuracy
A critical negative result on TruthfulQA (6) demonstrates that SGI measures semantic (topical) engagement, not factual correctness:
- Truthful and false answers to the same question cluster with nearly identical 7 (mean 8)
- Cohen’s 9, AUC=0.478 (worse than chance)
This outcome indicates that angular embedding geometry is sensitive to topical alignment rather than verifying truth, and cannot discriminate factuality once the topic is held constant.
5. Practical Implementation in RAG Systems
Production integration of SGI is efficient and low-overhead:
- Precompute 0 to estimate overall discriminative capacity; low 1 predicts weaker SGI signal.
- Recommended embedding models include all-MiniLM-L6-v2 for efficiency, with mpnet, bge, e5, and gte as alternatives.
- Standard workflow: For each 2 tuple, compute embeddings, apply L2 normalization, obtain two dot products, then two 3 evaluations. Add 4 to denominators to avoid division by zero.
- Raw SGI 5 flags likely hallucination; for risk stratification or probabilistic calibration, bin SGI and calibrate on held-out data.
- SGI is most impactful on long responses and short, information-dense questions—contexts where manual verification cost is highest.
- Combining SGI with complementary techniques, such as NLI-based entailment, extends coverage to both semantic disengagement and logical inconsistency.
6. The Semantic Laziness Hypothesis
SGI formalizes the "semantic laziness" hypothesis: responses flagged as hallucinated by human raters systematically display 6, while grounded responses exhibit higher SGI. This is captured: 7 This regularity, validated by large effect sizes and consistent cross-model correlations, underpins SGI’s practical and theoretical merit for detecting context disengagement in RAG systems (Marín, 15 Dec 2025).
7. Summary of LaTeX Expressions
8
SGI provides a theoretically justified and empirically validated index for measuring the degree to which generated responses are semantically grounded in context in RAG systems. Its geometric formulation, strong association with manual hallucination annotations, and efficient implementation collectively substantiate its relevance for large-scale, production-level LLM pipelines (Marín, 15 Dec 2025).