Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

PiCO: Peer Review in LLMs based on the Consistency Optimization (2402.01830v2)

Published 2 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Existing LLMs evaluation methods typically focus on testing the performance on some closed-environment and domain-specific benchmarks with human annotations. In this paper, we explore a novel unsupervised evaluation direction, utilizing peer-review mechanisms to measure LLMs automatically. In this setting, both open-source and closed-source LLMs lie in the same environment, capable of answering unlabeled questions and evaluating each other, where each LLM's response score is jointly determined by other anonymous ones. To obtain the ability hierarchy among these models, we assign each LLM a learnable capability parameter to adjust the final ranking. We formalize it as a constrained optimization problem, intending to maximize the consistency of each LLM's capabilities and scores. The key assumption behind is that high-level LLM can evaluate others' answers more accurately than low-level ones, while higher-level LLM can also achieve higher response scores. Moreover, we propose three metrics called PEN, CIN, and LIS to evaluate the gap in aligning human rankings. We perform experiments on multiple datasets with these metrics, validating the effectiveness of the proposed approach.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Kun-Peng Ning (11 papers)
  2. Shuo Yang (244 papers)
  3. Yu-Yang Liu (5 papers)
  4. Jia-Yu Yao (5 papers)
  5. Zhen-Hui Liu (2 papers)
  6. Yu Wang (939 papers)
  7. Ming Pang (8 papers)
  8. Li Yuan (141 papers)
Citations (5)