Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation (2407.10817v1)

Published 15 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing generative model trained exclusively on permissively licensed data, outperforming both GPT-4-0125 (85.9%) and GPT-4o (84.7%). Additionally, we explore a more computationally efficient approach using a novel tail-patch fine-tuning strategy to optimize our FLAMe multitask mixture for reward modeling evaluation (FLAMe-Opt-RM), offering competitive RewardBench performance while requiring approximately 25x less training datapoints. Overall, our FLAMe variants outperform all popular proprietary LLM-as-a-Judge models we consider across 8 out of 12 autorater evaluation benchmarks, encompassing 53 quality assessment tasks, including RewardBench and LLM-AggreFact. Finally, our analysis reveals that FLAMe is significantly less biased than these LLM-as-a-Judge models on the CoBBLEr autorater bias benchmark, while effectively identifying high-quality responses for code generation.

Citations (16)

Summary

  • The paper introduces FLAMe, a framework that trains LLM autoraters using over 5M human judgments for robust automated evaluation.
  • It employs supervised multitask learning and innovative fine-tuning, achieving up to 87.8% accuracy on RewardBench benchmarks.
  • The study demonstrates reduced bias and improved performance over proprietary models, paving the way for more accessible AI evaluation.

Foundational Autoraters: Taming LLMs for Better Automatic Evaluation

Overview

The paper "Foundational Autoraters: Taming LLMs for Better Automatic Evaluation" introduces FLAMe, a suite of foundational large autorater models designed to address the challenge of evaluating the outputs of LLMs automatically. Evaluation of LLMs has traditionally relied on human judgment, which is expensive and subject to variability. The authors propose using LLM autoraters as a viable alternative to overcome these limitations. FLAMe is trained on a diverse set of over 100 quality assessment tasks totaling more than 5 million human judgments. This approach leverages publicly available human evaluation datasets, which have been curated and standardized for robustness and generalization.

Methodology

The FLAMe framework encompasses multiple variants:

  1. FLAMe-24B: Trained via supervised multitask learning on a mixture of tasks with examples-proportional mixture weights.
  2. FLAMe-RM-24B: Fine-tuned from FLAMe specifically for reward modeling using a balanced mixture of four datasets.
  3. FLAMe-Opt-RM-24B: Utilizes an optimized multitask mixture derived through a novel tail-patch fine-tuning strategy for enhanced efficiency.

The authors elaborate on their thorough data curation process, which involves:

  • Collecting permissively licensed datasets with human annotations.
  • Standardizing these datasets into a unified text-to-text format.
  • Consulting original dataset authors to resolve ambiguities.

Results

The numerical evaluations demonstrate that FLAMe models outperform proprietary models like GPT-4 and Claude-3 on several tasks. Specifically:

  • RewardBench: FLAMe-RM-24B achieved an overall accuracy of 87.8%, outperforming GPT-4-0125 (85.9%) and GPT-4o (84.7%).
  • General Performance: FLAMe variants surpass proprietary LLM-as-a-Judge models across 8 out of 12 benchmarks, including LLM-AggreFact and diverse quality assessment tasks.
  • Autorater Bias: FLAMe models exhibit significantly less bias compared to popular autoraters like GPT-4.

Implications and Future Directions

The success of FLAMe in various evaluation benchmarks indicates its potential as a versatile tool for evaluating LLM outputs. The reduced dependency on proprietary data further highlights the potential for accessible AI research and development. Potential applications include:

  1. Automated Evaluation: LLMs can serve as reliable judges across a broad spectrum of quality assessment tasks, reducing the need for human evaluators.
  2. Code Generation: FLAMe’s ability to re-rank model outputs improves the quality of generated code, as shown in the HumanEval benchmark.

Future research can expand FLAMe’s capabilities in several ways:

  • Multilingual Support: Incorporating more multilingual datasets to enhance performance in non-English contexts.
  • Long-Context Tasks: Training on datasets with longer context lengths to improve performance on tasks requiring extensive text analysis.
  • Advanced Training Techniques: Exploring reinforcement learning with human feedback (RLHF) and direct preference optimization (DPO) to further refine autorater capabilities.

Conclusion

FLAMe represents a significant advancement in the field of automatic evaluation for LLMs. By leveraging a diverse and extensive set of human evaluations, FLAMe models demonstrate strong generalization across various benchmarks. The introduced methodologies for task standardization and efficient fine-tuning contribute to the robustness and versatility of these autoraters, setting a new standard for future models in this space. The results and analyses provided in this paper offer valuable insights for researchers and practitioners aiming to develop and utilize effective LLM autoraters.

Youtube Logo Streamline Icon: https://streamlinehq.com