Papers
Topics
Authors
Recent
2000 character limit reached

SHROOM-CAP 2025: Multilingual Hallucination Detection

Updated 30 November 2025
  • SHROOM-CAP 2025 Shared Task is a benchmark challenge that evaluates precise span-level hallucination detection in multilingual LLM outputs using binary and probabilistic labels.
  • The task advances LLM evaluation by incorporating diverse datasets, rigorous annotation protocols, and nuanced retrieval-augmented methods across 14 languages.
  • Key methodologies include prompt engineering, ensemble strategies, and expansive data-centric approaches that significantly improve IoU metrics and factual accuracy assessment.

The SHROOM-CAP 2025 Shared Task is a benchmark challenge centered on the detection of hallucinations in the outputs of instruction-tuned LLMs across a variety of languages. Hallucination is defined as fluent but factually incorrect (i.e., ungrounded or fabricated) information generated by LLMs in response to specific questions or prompts. The shared task operationalizes detection at the span level, requiring systems not only to determine whether an output contains hallucination but also to localize the exact character- or token-level spans corresponding to such errors, and, in some tracks, to assign confidence scores reflecting uncertainty or inter-annotator agreement. The 2025 edition, corresponding to Mu-SHROOM at SemEval-2025, has established itself as the central international benchmark for multilingual and fine-grained hallucination detection in generated scientific and general-purpose text (Vázquez et al., 16 Apr 2025).

1. Problem Definition and Motivation

The central challenge addressed by SHROOM-CAP 2025 is the automatic identification of factual ruptures (hallucinations) in LLM-generated texts, with emphasis on precise localization. Hallucinations in instruction-following generation undermine downstream applications in scientific writing, factual summarization, and knowledge extraction, as these errors “snowball” and propagate if undetected (Vázquez et al., 16 Apr 2025). Prior to SHROOM-CAP and Mu-SHROOM, most research restricted itself to English, sentence- or document-level binary detection, and monolithic datasets, limiting meaningful progress on both fine-grained localization and multilingual robustness.

SHROOM-CAP addresses these deficits by:

  • Extending span-labeling to character- and token-levels, inherently supporting nuanced detection of partially hallucinated answers.
  • Benchmarking across 14 typologically diverse languages, including high-resource and low-resource settings (e.g., Basque, Farsi, Hindi, Chinese).
  • Focusing on LLMs in general-purpose settings with real-world, challenging prompts, minimizing domain or model specificity.
  • Using multi-annotator, probabilistic labels to accommodate disagreement and ambiguity intrinsic to the hallucination concept (Vázquez et al., 16 Apr 2025).

2. Dataset Construction and Annotation Protocols

The SHROOM-CAP 2025 dataset construction integrates extensive multilingual Wikipedia curation, meticulous prompt and answer generation, and high-fidelity annotation:

  • Source Material: Approximately 762 Wikipedia articles available in at least three of 14 target languages.
  • Data Generation: For each language, task organizers authored one closed, factual question per page, then generated candidate LLM answers using 38 different open-weight models under diverse decoding configurations.
  • Annotation: Following the selection of a relevant and fluent answer per question, each response was annotated independently by at least three annotators (up to 12 for English and Chinese), marking minimal character-level hallucinated spans—only content that cannot be traced back to the reference Wikipedia page or verifiable sources is marked. Annotators were encouraged to be as conservative as possible and focus on content words.
  • Probabilistic Gold Standard: Gold labels per span are aggregated as probabilities (fraction of annotators marking each character), supporting both binary (“hard”) and confidence-weighted (“soft”) supervision and evaluation (Vázquez et al., 16 Apr 2025).

The dataset comprises 10 languages with both validation and test partitions (e.g., AR, DE, EN, ES, FI, FR, HI, IT, SV, ZH) plus 4 “surprise” test-only languages (CA, CS, EU, FA).

Inter-annotator agreement is measured by span-level intersection-over-union (IoU), with language-specific mean IoU ranging from 0.45 (Spanish) to 0.85 (Italian), reflecting typical boundary uncertainty for hallucination spans.

3. Evaluation Metrics and Ranking Procedures

SHROOM-CAP 2025 formalizes evaluation around two primary metrics, both computed at the character level, and further applies language-wise and average-based ranking schemes:

  • Intersection-over-Union (IoU):

IoU(C^,C)=C^CC^C\mathrm{IoU}(\hat C,\,C) = \frac{|\hat C\cap C|}{|\hat C\cup C|}

where CC is the set of gold-labeled hallucinated character indices, and C^\hat{C} is the predicted set.

  • Correlation (Spearman’s ρ\rho):

ρ=Spearman((Pr(c1),...,Pr(cn)),(p(c1),...,p(cn)))\rho = \mathrm{Spearman}\bigl((\Pr(c_1),...,\Pr(c_n)), (p(c_1),...,p(c_n))\bigr)

where Pr(cj)\Pr(c_j) is the empirical annotation probability for character jj and p(cj)p(c_j) is the system’s predicted probability.

Final system ranking is conducted by best mean IoU, with correlation used as secondary tie-breaker. All systems must output both “hard” (binary) and “soft” (probabilistic) span predictions (Vázquez et al., 16 Apr 2025).

4. Participating System Architectures and Methodological Advancements

The SHROOM-CAP 2025 task attracted 43 teams and 2,618 submissions, yielding a representative taxonomy of approach classes:

Approach Class Key Methods and Innovations
Data-Centric Dataset unification, class balance (e.g., “AGI” team’s 172x data scaleup, 1:1 label balancing) (Rathva et al., 23 Nov 2025)
Retrieval-Augmented Evidence retrieval (RAG systems), Wikipedia/Google CSE context, fact-claim extraction (Hong et al., 2 Mar 2025, Huang et al., 5 May 2025)
Prompt Engineering Task-specific, multilingual prompts, few-shot demonstration, structured probability tiers (Hikal et al., 27 May 2025)
Uncertainty/Ensemble LLM ensemble adjudication, pseudo-crowd voting, weak labeling functions (Hikal et al., 27 May 2025, Hong et al., 2 Mar 2025)

Notable systems include:

  • UCSC: Three-stage pipeline integrating retrieval (Perplexity Sonar, KG-based), LLM/fact detection via prompt-optimized extraction, and span mapping via substring or edit alignment. Prompt optimization (MiPROv2) is used for maximizing IoU on validation (Huang et al., 5 May 2025).
  • MSA: Weak labeling with heuristic span detection, strong prompt templates for span extraction, and ensemble adjudication with probabilistic majority voting; post-processing includes fuzzy Levenshtein span alignment (Hikal et al., 27 May 2025).
  • NCL-UoR: Modified RefChecker and SelfCheckGPT using external retrieval, structured prompt-based verification, and iterative consensus probability assignment to generate calibrated soft labels (Hong et al., 2 Mar 2025).
  • keepitsimple: LLM uncertainty quantification via stochastically sampled response divergence; entropy-based span marking without auxiliary detector fine-tuning (Vemula et al., 23 May 2025).
  • “AGI” Team: Data-centric focus; unified and balanced training corpus (124,821 samples; 50% correct/hallucinated); class-weighted cross-entropy XLM-RoBERTa fine-tuning; demonstrated strong zero-shot transfer, notably 2nd in Gujarati (F1=0.5107) (Rathva et al., 23 Nov 2025).

5. Empirical Results and Error Analysis

A comprehensive analysis of system performance on SHROOM-CAP 2025 reveals:

  • Best-in-class IoU for top teams reaches 0.736 (Italian, GPT-4o extractor), 0.684 (Hindi, Qwen-2.5), and 0.669 (Arabic, Gemini-2.0-Flash-Exp) (Hikal et al., 27 May 2025).
  • Data-centric strategies decisively outperform naive architectural changes or single-resource setups. For example, the “AGI” team reports a 172-fold data increase and class balance, lifting F1 by 5–15 points over translation-based augmentation (Rathva et al., 23 Nov 2025).
  • Retrieval augmentation (RAG) is highly correlated with better IoU (p<1059p < 10^{-59}), especially effective in high-resource languages and challenging inputs (Vázquez et al., 16 Apr 2025).
  • Prompt and span-labeling design: Explicit instructions, few-shot exemplars, and probability tiers increase recall and label calibration (e.g., MSA’s 75% exact-span recall during validation) (Hikal et al., 27 May 2025).
  • Cross-lingual and low-resource transfer: Systems leveraging scale and diversity, as well as careful prompt engineering, demonstrate “zero-shot uplift”—notably, the “AGI” data-centric system achieves second place in Gujarati, a zero-shot language (Rathva et al., 23 Nov 2025).
  • Error drivers: High inter-annotator disagreement, incomplete retrieval in low-resource settings, and fuzzy boundaries lead to mislocalized or missed hallucinated spans. Fuzzy matching post-processing recovers 5 absolute IoU points in high-morphology languages (Hikal et al., 27 May 2025).

6. Scientific and Methodological Implications

Key findings and recommendations from SHROOM-CAP 2025:

  • Scale and balance in data are collectively more influential than architecture; class bias elimination is critical for robust boundary learning (Rathva et al., 23 Nov 2025).
  • External knowledge retrieval is necessary for performance parity with human annotators. Purely model-internal methods (e.g., token logit anomaly, uncertainty) are consistently outperformed by systems leveraging Wikipedia or web search (Vázquez et al., 16 Apr 2025, Hong et al., 2 Mar 2025).
  • Prompt engineering and multilingual adaptation: Task- and language-specific templates with explicit annotation definitions substantially increase recall and precision; adaptive tuning per language further reduces error variance (Hikal et al., 27 May 2025, Hong et al., 2 Mar 2025).
  • Ensemble strategies: Simulated multi-annotator confidence voting (via multiple LLM runs or diverse models) mimics the annotation aggregation process and yields more stable soft-label outputs (Hikal et al., 27 May 2025).

7. Limitations and Prospects for Future Research

Despite methodological advances, SHROOM-CAP 2025 also highlights persistent challenges and future directions:

  • Annotation ambiguity: Even with rigorous protocol, boundary disagreement remains significant, especially at high annotator counts (up to 12 in EN/ZH). Improving guidelines or introducing adjudication phases may help (Vázquez et al., 16 Apr 2025).
  • Domain generalization: The current dataset is restricted to factual QA grounded in Wikipedia. Extension to domains such as open-ended dialog, clinical, or legal text is required for broader impact (Vázquez et al., 16 Apr 2025).
  • Fluency vs. factuality discrimination: Most hallucinations are factual, not fluency errors, but systems must be able to disentangle these phenomena in real-world applications (Rathva et al., 23 Nov 2025).
  • Calibration and robustness: Further developments in probabilistic output calibration (metrics beyond Spearman’s ρ\rho) and the integration of metadata, such as model logits or abstract type, are recommended (Rathva et al., 23 Nov 2025).
  • Hybrid and corrective pipelines: The integration of detection with automated hallucination correction (rewriting or retrieval feedback) and real-time intervention in LLM pipelines is a key avenue for practical deployment (Vázquez et al., 16 Apr 2025).

A plausible implication is that systematic data-centric design, cross-lingual prompt engineering, and probabilistic span aggregation together define the methodological frontier for robust, multilingual hallucination detection, as embodied by the diverse, high-performing systems submitted to SHROOM-CAP 2025.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to SHROOM-CAP 2025 Shared Task.