Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input (2501.03200v1)

Published 6 Jan 2025 in cs.CL

Abstract: We introduce FACTS Grounding, an online leaderboard and associated benchmark that evaluates LLMs' ability to generate text that is factually accurate with respect to given context in the user prompt. In our benchmark, each prompt includes a user request and a full document, with a maximum length of 32k tokens, requiring long-form responses. The long-form responses are required to be fully grounded in the provided context document while fulfilling the user request. Models are evaluated using automated judge models in two phases: (1) responses are disqualified if they do not fulfill the user request; (2) they are judged as accurate if the response is fully grounded in the provided document. The automated judge models were comprehensively evaluated against a held-out test-set to pick the best prompt template, and the final factuality score is an aggregate of multiple judge models to mitigate evaluation bias. The FACTS Grounding leaderboard will be actively maintained over time, and contains both public and private splits to allow for external participation while guarding the integrity of the leaderboard. It can be found at https://www.kaggle.com/facts-leaderboard.

The paper "The FACTS Grounding Leaderboard: Benchmarking LLMs' Ability to Ground Responses to Long-Form Input" by authors from Google DeepMind, Google Research, and Google Cloud introduces a benchmark and online leaderboard called FACTS Grounding to evaluate the ability of LLMs to generate factually accurate and contextually grounded responses to long-form input.

Key Components of FACTS Grounding

  1. Benchmark Structure: The benchmark scrutinizes the ability of LLMs to produce long-form responses that are fully grounded in a provided context document, which can be as long as 32,000 tokens. Each task includes a user prompt comprising a request and a full document that needs to be addressed by the model.
  2. Evaluation Process:
    • Phase 1: Responses are filtered based on whether they fulfill the user's request.
    • Phase 2: Responses that pass Phase 1 are then evaluated for factual accuracy, confirming that they are fully grounded in the given document context.
  3. Automated Judge Models: The task uses multiple automated judge models to evaluate grounded responses. These models utilize a diverse set of prompt templates to mitigate bias in evaluation, ensuring robust assessment measures.
  4. Scoring Mechanism: The final factuality score for a model is an aggregate of the factuality judgements made by multiple judge models. This system helps in minimizing individual model biases and provides a more balanced evaluation process.
  5. Leaderboard Characteristics: FACTS Grounding maintains an active leaderboard with public and private splits, encouraging external participation while protecting the integrity of benchmark results.

Addressing Challenges in LLM Factuality

  • Factuality Scenarios: The research articulates two primary challenges in factuality:

    1. Context-grounded factuality, where the model must remain faithful to the input context.
    2. Factuality with respect to external sources or general world knowledge, which presents a different set of challenges.
  • Complexity in Modeling and Measurement: Ensuring factuality in LLM responses involves challenges in both modeling (during architecture, training, and inference stages) and evaluation (through data and metrics). LLM pretraining generally optimizes for next-token prediction, which doesn't directly align with factual grounding objectives. Training processes have been shown to allow non-factual generation, thus requiring additional post-training tuning methods.

  • Mitigation through Post-training and Inference Approaches: Techniques like supervised fine-tuning and reinforcement learning have been implemented to improve factuality. Inference-time strategies such as prompting techniques and model state interpretability are additional methods explored to address hallucinations and accuracy.

Construction of Data and Evaluation Methods

  • Diverse Annotation: The FACTS Grounding benchmark includes diverse prompts covering different document lengths and enterprise domains like finance, technology, medical, and legal sectors. Human annotators created prompts to assure diverse and complex inference requirements.
  • Document Sourcing and Validation: Authors describe a comprehensive validation process to ensure task quality and diversity, alongside noting potential contamination from training corpora with a focus on evaluating grounding, exclusive of pre-trained knowledge influence.
  • Automated Evaluation Metrics: New methodologies were rigorously evaluated against a held-out test dataset to validate their performance, combining multiple scoring aggregations to reduce evaluator bias.

Conclusion and Future Directions

The paper suggests that the FACTS Grounding leaderboard fills a gap in evaluating the factual consistency of LLM responses in long-form tasks. It serves as a valuable tool for the research community to advance understanding and development of factual capabilities in LLMs. Through continuous updates and new model inclusions, the leaderboard is poised to contribute significantly to the ongoing challenge of improving and measuring LLM factuality.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (26)
  1. Alon Jacovi (26 papers)
  2. Andrew Wang (42 papers)
  3. Chris Alberti (23 papers)
  4. Connie Tao (4 papers)
  5. Jon Lipovetz (1 paper)
  6. Kate Olszewska (5 papers)
  7. Lukas Haas (4 papers)
  8. Michelle Liu (5 papers)
  9. Nate Keating (2 papers)
  10. Adam Bloniarz (4 papers)
  11. Carl Saroufim (3 papers)
  12. Corey Fry (2 papers)
  13. Dror Marcus (4 papers)
  14. Doron Kukliansky (3 papers)
  15. Gaurav Singh Tomar (14 papers)
  16. James Swirhun (1 paper)
  17. Jinwei Xing (6 papers)
  18. Lily Wang (8 papers)
  19. Madhu Gurumurthy (3 papers)
  20. Michael Aaron (1 paper)
Youtube Logo Streamline Icon: https://streamlinehq.com