Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 165 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 27 tok/s Pro
GPT-5 High 27 tok/s Pro
GPT-4o 64 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 432 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

FACTS Grounding Benchmark Overview

Updated 31 October 2025
  • FACTS Grounding Benchmark is a framework that rigorously evaluates LLMs' ability to generate long-form responses strictly grounded in supplied context documents.
  • The benchmark utilizes diverse, extended context inputs (up to 32k tokens) from domains like finance, legal, and medical to stress-test factual accuracy.
  • It employs a two-stage evaluation with ensemble LLM judges to differentiate between mere instruction following and true, context-supported factuality.

The FACTS Grounding Benchmark is an evaluation framework and leaderboard that measures the ability of LLMs to generate long-form responses that are factually accurate and strictly grounded in a supplied context document. It emphasizes high-fidelity attribution in extended-generation scenarios, addressing the persistent challenge of hallucination and ungrounded outputs in LLMs. Each prompt in the benchmark contains a user request paired with a full-length document (up to 32k tokens) and demands output that not only meets the user’s informational needs but also adheres to stringent constraints: all substantive claims must be directly supported by the provided context, and no extraneous knowledge may be introduced.

1. Benchmark Motivation and Task Framing

The FACTS Grounding Benchmark is motivated by the need to rigorously validate grounded factuality in LLM outputs, especially for responses to long-form, information-intensive prompts. Prior benchmarks largely focus on short-form tasks or summarization, which inadequately stress models’ ability to maintain grounding over extended context. FACTS confronts models with prompts and context documents spanning up to 32,000 tokens across a wide spectrum of domains (finance, legal, medical, technical), demanding that every claim be attributable to the input. The two-phase evaluation distinguishes between mere instruction following and genuine grounding, penalizing both hallucination and vacuous compliance.

2. Dataset Composition and Prompt Structure

Each example in the benchmark comprises three core components:

Component Description
System Instruction Directive not to use outside knowledge, answer only from context
Context Document Document (0.9k – 32k tokens) from diverse, realistic domains
User Request Complex task requiring synthesis, extraction, or analysis

Prompts are constructed to elicit substantive, context-dependent responses, excluding generative or creative tasks. The focus is on extraction, comparative analysis, and reasoning grounded exclusively in the input document. All data is human-authored and domain-heterogeneous, supporting broad generalization assessment.

3. Grounding and Factuality Criteria

Stringent grounding requirements underpin the benchmark:

  • Positive (“accurate”) label: Every substantive claim in the response must be exactly attributable to the context, except “no_rad” utterances (e.g., “Thank you”) that lack informational content.
  • Negative (“inaccurate”) label: Any claim not present or contradicted by the context, or vague/unsubstantiated content, disqualifies the response.
  • Eligibility: Responses must meaningfully address the user’s request, not merely repeat document content or answer trivially.

The evaluation pipeline is bifurcated:

  1. Eligibility Filtering: Using ensemble LLM judges, responses failing to fulfill the request are masked as ineligible.
  2. Factuality Scoring: Only eligible responses undergo grounding verification; responses must not draw on external information.

4. Evaluation Methodology and Scoring Formulas

FACTS leverages a rigorously validated “LLM-as-a-Judge” procedure, employing an ensemble of leading models (Gemini 1.5 Pro, Claude 3.5 Sonnet, GPT-4o) and purpose-tuned judge prompt templates.

Scoring formulas follow:

  • Let NN be the number of test examples.
  • For each judge model jj, Sj=Number of accurate responsesNS_j = \frac{\text{Number of accurate responses}}{N}.
  • Unadjusted factuality score:

Unadjusted Factuality Score=13j=13Sj\text{Unadjusted Factuality Score} = \frac{1}{3}\sum_{j=1}^3 S_j

  • Final factuality score:

Final Factuality Score=Number of accurate and eligible responsesN\text{Final Factuality Score} = \frac{\text{Number of accurate and eligible responses}}{N}

Ineligibility is resolved by consensus: a response is marked ineligible if all three judge models agree. The ensemble strategy mitigates self-judging bias (+3.2% bias observed empirically) and ensures greater robustness than single-model evaluation.

Leaderboard rankings use Condorcet rank aggregation, averaging scores across public and private splits and all judges. Confidence intervals are reported.

5. Model Performance and Empirical Insights

Empirical evaluation reveals substantial gaps between current state-of-the-art models and ideal factuality:

  • Top-performing models (Gemini 2.0 Flash Experimental, Gemini 1.5 Flash, Gemini 1.5 Pro) achieve the highest leaderboard scores.
  • Eligibility filtering impacts ranking: excluding ineligible responses (those not meaningfully attending to the prompt) reduces factuality scores by 1–5% and may change leaderboard order.
  • Even the strongest LLMs struggle to meet strict grounding requirements on complex, long-context prompts. Models frequently hallucinate or offer partially grounded answers under the stress of large input contexts and multifaceted user requests.
  • The strict criteria employed make FACTS diagnostic and challenging, with significant separability vs. benchmarks focusing on short-form or single-task factuality.

6. Comparison with Prior Methods and Benchmarks

FACTS Grounding diverges from earlier benchmarks in several respects:

  • Evaluates long-form, context-dependent generation rather than short-form QA or summarization.
  • Enforces attribution at the sentence or span level, invoking consensus of multiple LLM judges for increased reliability.
  • Public (“open”) and private (“blind”) splits safeguard against leaderboard overfitting and gaming.
  • Exposes model weaknesses in contexts that mirror enterprise and real-world deployments.

The methodology counteracts prevalent evaluation deficiencies:

  • Mitigates judge bias by ensemble approaches.
  • Discourages models from submitting vacuous, grounded but non-informative outputs.
  • Applies a two-stage pipeline to separate instruction following (“does it answer the user’s request?”) from strict factuality/grounding.

7. Impact, Applications, and Directions

The FACTS Grounding Benchmark sets a rigorous standard for model evaluation, with several explicit implications:

  • Enables quantifiable comparison of LLMs’ ability to generate context-faithful answers at scale, relevant to domains where hallucination or misattribution has high cost (medical, legal, financial).
  • Provides actionable guidance for model development—highlighting the limits of current approaches and identifying failure modes in real-world scenarios.
  • The ongoing leaderboard (https://www.kaggle.com/facts-leaderboard) supports continuous benchmarking, stimulating community progress and preventing stagnation or overfitting to any single test split.
  • Future directions include expanding domain diversity, refining judge templates, and exploring more granular grounding metrics (e.g., span-level attribution, semantic entailment frameworks).
Aspect Details
Motivation Long-form, grounded factuality evaluation for LLMs
Data ~1700 examples, 0.9k–32k tokens, multi-domain, expert-curated
Scoring Multimodel LLM-judge ensemble, formulas above
Leaderboard Active, public/private splits, Condorcet aggregation
Impact Reveals SOTA strengths/weaknesses; robust tool for high-fidelity attribution benchmarking

A plausible implication is that continued advances in prompt engineering, context representation, and evaluation strategies will be necessary to close the gap revealed by FACTS. Strict, context-constrained grounding will remain a challenging and essential metric for evaluating next-generation LLMs (Jacovi et al., 6 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FACTS Grounding Benchmark.