Can LLM be a Personalized Judge? (2406.11657v1)

Published 17 Jun 2024 in cs.CL and cs.CY

Abstract: Ensuring that LLMs reflect diverse user values and preferences is crucial as their user bases expand globally. It is therefore encouraging to see the growing interest in LLM personalization within the research community. However, current works often rely on the LLM-as-a-Judge approach for evaluation without thoroughly examining its validity. In this paper, we investigate the reliability of LLM-as-a-Personalized-Judge, asking LLMs to judge user preferences based on personas. Our findings suggest that directly applying LLM-as-a-Personalized-Judge is less reliable than previously assumed, showing low and inconsistent agreement with human ground truth. The personas typically used are often overly simplistic, resulting in low predictive power. To address these issues, we introduce verbal uncertainty estimation into the LLM-as-a-Personalized-Judge pipeline, allowing the model to express low confidence on uncertain judgments. This adjustment leads to much higher agreement (above 80%) on high-certainty samples for binary tasks. Through human evaluation, we find that the LLM-as-a-Personalized-Judge achieves comparable performance to third-party humans evaluation and even surpasses human performance on high-certainty samples. Our work indicates that certainty-enhanced LLM-as-a-Personalized-Judge offers a promising direction for developing more reliable and scalable methods for evaluating LLM personalization.

PDF HTML Abstract

Evaluating the Validity of LLM-as-a-Personalized-Judge

Introduction

The paper critically assesses the LLM-as-a-Judge approach, particularly its application in personalization tasks. The authors argue that while LLMs have demonstrated high agreement with human annotators in various tasks, their effectiveness in personalization tasks remains questionable. The core issue identified is the oversimplified personas, leading to low agreement with human ground truth. To address this, the paper introduces a verbal uncertainty estimation mechanism, allowing LLMs to express their confidence levels. The findings suggest that this enhancement improves the reliability of LLM-as-a-Personalized-Judge, offering comparable performance to human evaluators, particularly in high-certainty samples.

Key Contributions

The paper's primary contributions are threefold:

Validation of LLM-as-a-Personalized-Judge: The paper scrutinizes the reliability of using LLMs to judge user preferences based on personas. The authors find that standard LLM-as-a-Judge approaches show only around 70% agreement with human judgments in binary choice tasks, contradicting previous assumptions of higher reliability.
Introduction of Verbal Uncertainty Estimation: To mitigate the identified reliability issues, the authors propose integrating verbal uncertainty estimation into the evaluation process. This enables the LLM to express low confidence in uncertain judgments, which significantly improves agreement on high-certainty samples, achieving above 80% accuracy.
Human Evaluation Experiment: The authors conduct a human evaluation experiment to benchmark the LLM-as-a-Personalized-Judge against third-person human judgment. The experiment reveals that LLMs can match or even surpass human performance, particularly on high-certainty samples.

Methodology

The authors employ a robust experimental framework to assess the validity of LLM-as-a-Personalized-Judge. They test different models, including GPT-4, GPT-3.5, Command R+, and LLama3 70B, across multiple datasets: PRISM, OpinionQA, Public Reddit (PR), and Empathetic Conversation (EC). The key methodological steps are:

Datasets: The authors use diverse datasets with available ground truth to evaluate the performance of LLM-as-a-Personalized-Judge. These datasets encompass a range of user preferences and demographic variables.
Experimental Setups: Three experimental setups are used to investigate the reliability of the LLM-as-a-Personalized-Judge:
- Standard LLM-as-a-Personalized-Judge: The model makes preference judgments based on the given persona.
- Standard LLM-as-a-Personalized-Judge with Verbal Uncertainty Estimation: The model also estimates its certainty level in its predictions.
- Standard LLM-as-a-Personalized-Judge with a Tie Option: The model can indicate a tie in addition to choosing between two options.
Human Evaluation: The authors conduct a crowdsourcing experiment where human annotators judge user preferences based on provided personas. The LLM's performance is then compared to these human judgments.

Results

The findings of this paper reveal several critical insights:

Low Agreement with Human Ground Truth: The standard LLM-as-a-Personalized-Judge shows lower than anticipated agreement with human judgments, particularly in more challenging tasks.
Effectiveness of Verbal Uncertainty Estimation: Introducing verbal uncertainty estimation significantly improves the model's performance in high-certainty samples. For example, GPT-4 can achieve around 80% accuracy in high-certainty samples, aligning with human-level performance.
Human-Level Performance: The paper finds that LLM-as-a-Personalized-Judge can match or even surpass the performance of third-person human judges, particularly in high-certainty samples.

Implications and Future Directions

The paper's findings have significant theoretical and practical implications:

Theoretical Implications: The paper underscores the limitations of current LLM-as-a-Judge approaches in personalization tasks, highlighting the need for more nuanced persona representations. It also demonstrates the potential of uncertainty estimation as a reliability-enhancing mechanism.
Practical Implications: The improved LLM-as-a-Personalized-Judge framework can serve as a scalable and effective alternative for evaluating LLM personalization when first-person annotations are unavailable. This can be particularly valuable for applications where collecting first-person data is impractical.
Future Developments: Future research should focus on developing more sophisticated methods for persona representation to address the issue of persona sparsity. Additionally, exploring other uncertainty quantification techniques could further enhance the reliability of LLM judgments. Extending the evaluation to non-English languages and examining the cultural biases in LLMs are also critical areas for future investigation.

Conclusion

This paper makes a compelling case for re-evaluating the reliability of LLM-as-a-Personalized-Judge approaches. By identifying the limitations of oversimplified personas and proposing verbal uncertainty estimation, the authors provide a pathway to more reliable and scalable methods for evaluating LLM personalization. The paper's findings highlight the potential for LLMs to achieve human-comparable performance in personalization tasks, particularly when leveraging certainty thresholds. This work lays a foundation for future research aimed at developing LLMs that better cater to diverse individual preferences.