Prometheus 2: An Open Source Language Model Specialized in Evaluating Other Language Models (2405.01535v2)

Published 2 May 2024 in cs.CL

Abstract: Proprietary LMs such as GPT-4 are often employed to assess the quality of responses from various LMs. However, concerns including transparency, controllability, and affordability strongly motivate the development of open-source LMs specialized in evaluations. On the other hand, existing open evaluator LMs exhibit critical shortcomings: 1) they issue scores that significantly diverge from those assigned by humans, and 2) they lack the flexibility to perform both direct assessment and pairwise ranking, the two most prevalent forms of assessment. Additionally, they do not possess the ability to evaluate based on custom evaluation criteria, focusing instead on general attributes like helpfulness and harmlessness. To address these issues, we introduce Prometheus 2, a more powerful evaluator LM than its predecessor that closely mirrors human and GPT-4 judgements. Moreover, it is capable of processing both direct assessment and pair-wise ranking formats grouped with a user-defined evaluation criteria. On four direct assessment benchmarks and four pairwise ranking benchmarks, Prometheus 2 scores the highest correlation and agreement with humans and proprietary LM judges among all tested open evaluator LMs. Our models, code, and data are all publicly available at https://github.com/prometheus-eval/prometheus-eval.

PDF Abstract

Exploring Prometheus 2: A Leap in Open Source LLM Evaluators

Introduction to LLM Evaluation

LLM evaluation is a vital area of research that seeks to measure and improve the outputs from AI models designed for understanding and generating human-like text. Traditionally, proprietary models like GPT-4 have been used for high-standard evaluations, offering speed and reduced costs. However, the use of such proprietary models poses critical issues including limited transparency and high costs, pushing the demand for open-source alternatives.

The Challenges with Existing Open-Source Evaluators

Current open-source models are either tailored only for direct assessment—providing a score based on set criteria—or for pairwise ranking, deciding between two responses based on preference. Moreover, they tend to focus on general attributes like helpfulness and often fall short of mirroring the intricate judgment capabilities of proprietary models or human evaluators.

Introducing Prometheus 2: What Sets It Apart?

Prometheus 2 is developed as an advanced evaluator LLM that excels in both direct assessment and pairwise ranking, directly addressing the inflexibilities of prior models. Key features of Prometheus 2 include:

High Correlation with Humans and Proprietary Models: Unlike its predecessors, Prometheus 2 demonstrates a significantly higher correlation with both human judgments and proprietary models across various benchmarks.
Flexibility Across Evaluation Formats: This model is uniquely capable of handling both main types of evaluation formats seamlessly, which is an improvement over its open-source predecessors that typically handle only one.
Custom Evaluation Criteria: Going beyond basic assessment criteria, Prometheus 2 utilizes a rich set of over 1,000 user-defined criteria, making it adaptable for diverse and specific evaluation needs.

How Does Prometheus 2 Work?

The development of Prometheus 2 involves an innovative approach known as "weight merging," where the model integrates separate models trained on direct assessment and pairwise ranking. Here’s a simplified breakdown of the process:

Direct Assessment Base: In this setup, the model scores responses on a Likert scale based on the alignment with a given reference answer and specified criteria.
Pairwise Ranking Base: This involves choosing the preferred response from a pair, again considering specific criteria and perhaps a reference for guidance.
Merged Model Training: By training separate models on these two formats and then merging their weights, Prometheus 2 retains the strengths of both approaches, leading to a robust evaluator that performs well across different assessment types.

Empirical Success and Practical Implications

In testing, Prometheus 2 outperforms existing models in terms of agreement with human and proprietary evaluations, especially in complex benchmarks that involve nuanced judgment calls. This not only proves its efficacy but also highlights the potential to reduce reliance on costly proprietary models for those needing robust evaluation tools in academic, development, or commercial settings.

Looking Towards the Future

The introduction of Prometheus 2 opens up numerous possibilities for the future of AI evaluation:

Enhanced Accessibility: By providing an open-source alternative that competes with proprietary models, smaller entities or individual researchers can conduct high-quality evaluations without prohibitive costs.
Greater Customizability: The ability to define custom criteria means that users can tailor evaluations much more closely to the specific needs of different applications.
Continued Development: The architecture of Prometheus 2 allows for ongoing improvements and adaptations, signaling continuous advancement in how AI models are evaluated.

In conclusion, Prometheus 2 not only sets a new standard for open-source LLM evaluators but also encourages a shift towards more transparent, customizable, and cost-effective AI evaluation methods. As this field grows, the potential for more nuanced and widespread uses of such technology is boundless.