Calibrating LLM-Based Evaluator
The paper examines the calibration of LLM-based evaluators used in the assessment of natural language generation (NLG) quality. The authors introduce AutoCalibrate, a multi-stage, gradient-free approach designed to align LLM-based evaluators with human preferences more closely. This paper addresses notable gaps in current LLM-based evaluative practices, highlighting their sensitivity to format and prompts and their lack of alignment with human judgment standards due to ambiguous scoring criteria.
Key Methodology
AutoCalibrate employs a novel calibration process comprising several stages:
- Data Labeling as Human Preference: Human preference is indirectly encoded through a set of expertly labeled sample-score pairs. This serves as the benchmark for aligning the LLM-based evaluator, ensuring that it mirrors human judgment more accurately.
- Criteria Drafting: Utilizing the powerful instruction-following capacity of LLMs, a diverse initial set of scoring criteria is generated through few-shot in-context examples. This step emphasizes leveraging the inherent learning capabilities of LLMs to infer scoring criteria, which are crucial for evaluating NLG tasks without the need for extensive reference outputs.
- Criteria Revisiting and Refinement: The criteria quality is enhanced through evaluation against expert labels, selecting top performers, and refining them using self-adjustment based on examples that previously led to differing human and model evaluations. The process leverages LLM's self-refinement capabilities to iteratively improve the scoring guidelines.
Experimental Evaluation
The AutoCalibrate framework was tested on various NLG tasks, including text summarization, data-to-text generation, and hallucination evaluation across datasets such as NewsRoom, SummEval, SFRES, SFHOT, and QAGS. The focus was on enhancing the correlation between LLM-generated scores and expert human evaluations.
- On text summarization tasks, AutoCalibrate demonstrated substantial improvements over both traditional metrics like ROUGE and advanced LLM-based evaluations without calibration, indicating that explicit criteria considerably boost evaluator performance.
- In data-to-text generation evaluation, the framework notably surpassed other model-based methods, illustrating its efficacy in aligning LLM evaluations with human judgment.
- It also showed robustness across various datasets when evaluating hallucinations, further suggesting its applicability to different NLG evaluation contexts.
Implications and Future Directions
The findings underscore the potential of using LLMs as robust, reference-free evaluators once adequately calibrated. The gradient-free nature of AutoCalibrate supports its application in constrained environments where access to model weights or additional training is impractical.
Further research could explore extending AutoCalibrate to a broader range of language tasks, improving criteria induction techniques, or refining self-refinement strategies to iteratively enhance alignment further. The framework presents a foundational step toward more accurate and reliable automatic evaluation frameworks for NLG, leveraging state-of-the-art LLM capabilities aligned with human evaluative standards.