Sentence-level Hallucination Ratio (SHR)
- Sentence-level Hallucination Ratio (SHR) is a metric that measures the proportion of generated sentences containing fabricated or non-grounded entities.
- It computes the ratio through rigorous pipelines involving sentence segmentation, object detection, and alignment techniques tailored to multimodal and ASR outputs.
- Severity-aware variants of SHR weight errors based on their impact, enhancing its utility in benchmarking and optimizing AI models.
The Sentence-level Hallucination Ratio (SHR) is a quantitative metric designed to measure the incidence of hallucinated output—defined as sentences containing fabricated or non-grounded entities—within model-generated text. SHR has emerged as a central diagnostic tool for multimodal LLMs (MLLMs), vision-LLMs (VLVMs), and automatic speech recognition (ASR) systems, allowing researchers to isolate and quantify hallucination phenomena at a fine-grained, interpretable level (Peng et al., 16 Jul 2025, Xiao et al., 2024, Koudounas et al., 18 Oct 2025).
1. Formal Definition and Mathematical Formulation
SHR is formally defined as the fraction of generated sentences flagged as hallucinated out of the total number of generated sentences within a given evaluation set. Let be the total number of sentences generated by a model, and the count of sentences classified as hallucinated. The metric is expressed as:
where denotes the th generated sentence and is the indicator function returning 1 for hallucinated sentences and 0 otherwise (Peng et al., 16 Jul 2025, Xiao et al., 2024, Koudounas et al., 18 Oct 2025). In the context of ASR outputs, SHR is equivalently defined, with hallucination determined by the presence of tokens not grounded in the acoustic input (Koudounas et al., 18 Oct 2025).
2. Computation Pipeline and Annotation Schemes
The computation of SHR involves distinct annotation and validation pipelines tailored to the modality of the generated response:
- Multimodal and Vision-LLMs:
- Sentence Segmentation: Responses are partitioned into sentences at boundary punctuation (typically periods).
- Object Extraction: Each sentence is parsed to extract mentioned object entities using scene-graph parsing and standard NLP preprocessing (POS filtering, lemmatization).
- Validation Against Visual Inputs: Each extracted entity is validated using two open-vocabulary detectors (e.g., GroundingDINO and YOLO-World). Entities confirmed present by both detectors are factual; those confirmed absent are hallucinated; detection disagreement yields “uncertain,” ignored in sentence classification.
- Sentence Labeling: Sentences containing at least one hallucinated object and no factual contradictions are tagged as hallucinated. Sentences mentioning only factual or uncertain entities are considered non-hallucinated (Peng et al., 16 Jul 2025).
- Sentence-level AI Feedback Annotation:
- Classical Approach: Sentence-level labels are generated by proprietary models (e.g., GPT-4/GPT-4V) with access to image annotations, which then provide binary hallucination judgments and severity scores (0–3).
- Learned Sentence-level Detector (H-DER): Detection models such as InternVL-Chat-Plus fine-tuned via LoRA replicate the annotation protocol, outputting hallucination type (<object>, <attribute>, <relationship>), explanation, and severity (Xiao et al., 2024).
- Automatic Speech Recognition:
- Alignment of Hypothesis & Reference: Each ASR output is aligned with a human-annotated reference via edit distance.
- Hallucination Decision: Insertions, unjustified substitutions, and semantically incoherent tokens trigger a hallucination label for the entire sentence (Koudounas et al., 18 Oct 2025).
3. Severity-aware Extensions and Weighted Variants
To address the binary coarseness of classic SHR, severity-aware variants have been developed (Xiao et al., 2024). Each sentence receives a severity score , and the length-weighted action score for a response is:
where is the token length of segment . Setting for any hallucinated sentence and $0$ otherwise recovers classic SHR. Severity-aware SHR emphasizes disproportionately harmful hallucinations by scaling their impact according to severity and segment length, which has been shown to better target major versus minor errors (Xiao et al., 2024).
4. Empirical Behavior and Benchmarking
Reported SHR values across modalities and architectures illustrate its sensitivity to hallucination mitigation strategies:
| Model | Dataset | Baseline SHR | SHR After Mitigation | Relative Reduction |
|---|---|---|---|---|
| LLaVA-v1.5-7B | Object HalBench | 0.527 | 0.043 | ~91.8% |
| LLaVA-v1.5-13B | Object HalBench | 0.460 | 0.033 | ~92.8% |
| LLaVA | AMBER CHAIR_s | 0.463 | 0.053 | ~89% |
| LLaVA-1.5 (MMHal-Bench) | MMHal-Bench | 0.57 | 0.48 | ~15.8% |
| ASR Example (Koudounas et al., 18 Oct 2025) | Insertion-only Toy Set | 0.4 | (N/A) | - |
Empirical studies demonstrate that methods such as SENTINEL (Peng et al., 16 Jul 2025) and HSA-DPO (Xiao et al., 2024) consistently reduce SHR by factors exceeding 90% in vision-captioning settings, confirming the utility of sentence-level early intervention and severity-weighted preference learning.
5. Practical Significance, Limitations, and Domain Nuances
SHR is favored for its interpretability—reporting “the fraction of sentences containing any hallucination”—and its diagnostic utility in high-stakes applications such as medical ASR transcription (Koudounas et al., 18 Oct 2025). Its use encourages strict monitoring policies (“never hallucinate”) and enables post-hoc analysis by examining the underlying hallucination dimensions (lexical, phonetic, morphological, semantic).
However, several nuances and limitations are salient:
- Binary Coarseness: SHR does not differentiate minor from major errors once a sentence is flagged; severity-aware variants partially address this issue (Xiao et al., 2024).
- Detector/Annotation Accuracy: SHR is highly sensitive to the precision and recall of object detectors or annotation models; bias or false positives/negatives directly inflate or deflate (Peng et al., 16 Jul 2025).
- Domain Shift: Detector generalization to rare or specialized domains is limited; domain-specific errors can affect SHR reliability.
- Sentence Segmentation and Thresholds: Both the segmentation protocol and detection confidence thresholds affect the quantitative SHR value, requiring consistency across comparative studies.
- No Error Decomposition: SHR does not disaggregate hallucinations by error type; multi-dimensional scores such as SHALLOW provide complementary insights (Koudounas et al., 18 Oct 2025).
6. Methodological Connections and Evaluation Strategies
SHR integrates with broader methodologies involving preference learning frameworks and automated annotation pipelines:
- Iterative Contextual Bootstrapping (ICB): Utilizing context-coherent non-hallucinated sentences to iteratively build robust preference data and enhance sentence-level discrimination in training (Peng et al., 16 Jul 2025).
- Detect-then-Rewrite: Employing sentence-level hallucination detection as a filtering mechanism in constructing training datasets and guiding model refinements (Xiao et al., 2024).
- Benchmark Alignment: Quantitative SHR evaluations are conducted on held-out datasets such as Object HalBench, AMBER’s generative split, HallusionBench (image-context reasoning), and MHaluBench, enabling fair cross-model comparisons by fixing annotation schemes, detectors, and thresholds (Peng et al., 16 Jul 2025, Xiao et al., 2024).
In summary, the Sentence-level Hallucination Ratio (SHR) is a robust, reproducible metric for quantifying the prevalence of hallucinated sentences in model output, distinguished by its generality across modalities and its role in benchmarking, training optimization, and risk analysis. Severity-graded extensions and automated pipelines further enhance its diagnostic resolution and practical relevance in contemporary multimodal and ASR modeling (Peng et al., 16 Jul 2025, Xiao et al., 2024, Koudounas et al., 18 Oct 2025).