Wider and Deeper LLM Networks are Fairer LLM Evaluators

Published 3 Aug 2023 in cs.CL | (2308.01862v1)

Abstract: Measuring the quality of responses generated by LLMs is a challenging task, particularly when it comes to evaluating whether the response is aligned with human preference. A novel approach involves using the LLM itself to make evaluation and stabilizing the results through multiple independent evaluations, similar to a single-layer narrow LLM network. This network consists of a fixed number of neurons, with each neuron being the same LLM. In this paper, we draw upon the extensive research on deep neural networks to explore whether deeper and wider networks can lead to fairer evaluations. Specifically, inspired by the observation that different neurons in a neural network are responsible for detecting different concepts, we first adaptively generate as many neuron roles as possible for each evaluation sample. Each perspective corresponds to the role of a specific LLM neuron in the first layer. In subsequent layers, we follow the idea that higher layers in deep networks are responsible for more comprehensive features, each layer receives representations from all neurons in the previous layer, integrating the locally learned evaluation information to obtain a more comprehensive evaluation result. Interestingly, this network design resembles the process of academic paper reviewing. To validate the effectiveness of our method, we construct the largest and most diverse English evaluation benchmark LLMEval$^2$ for LLM evaluators, comprising 15 tasks, 8 abilities, and 2,553 samples. Experimental results demonstrate that a wider network (involving many reviewers) with 2 layers (one round of discussion) performs the best, improving kappa correlation coefficient from 0.28 to 0.34. We also leverage WideDeep to aid in the assessment of Chinese LLMs, which has accelerated the evaluation time by 4.6 times, resulting in a 60% cost saving. WideDeep achieves a remarkable 93% agreement level among humans.

Abstract PDF Chat (Pro)

Citations (69)

View on Semantic Scholar

Summary

The paper demonstrates that wider and deeper LLM networks boost evaluation accuracy, increasing kappa correlation from 0.28 to 0.34.
The methodology deploys a multi-layer design where neurons specialize in distinct roles, mirroring the academic peer review process.
The approach significantly cuts evaluation time and costs, achieving 4.6× acceleration, 60% cost savings, and 93% agreement with human annotations.

Overview of "Wider and Deeper LLM Networks are Fairer LLM Evaluators"

The paper "Wider and Deeper LLM Networks are Fairer LLM Evaluators" by Xinghua Zhang et al. investigates a significant challenge in the deployment and evaluation of LLMs: how to effectively assess their outputs' alignment with human preference. The authors propose novel architectures inspired by deep neural network designs, suggesting that deeper and wider LLM networks could lead to fairer evaluations, akin to the process of academic peer review.

Methodology and Network Design

The paper introduces a multi-layer wide LLM network, where each neuron represents an LLM tasked with specific evaluation roles. The concept is rooted in the observation that varying neurons in deep networks specialize in detecting different concepts. The approach begins by generating diverse evaluation perspectives for each sample, assigning these as distinct roles to LLM neurons within the network's first layer. Subsequent layers synthesize information from all neurons in prior layers, forming comprehensive evaluations akin to layered feature abstraction in classical deep learning models.

The design closely parallels the process of peer reviewing: multiple independent reviews (first layer) followed by interactive discussion (second layer) before making an aggregate decision on the paper's acceptance. This analogy underscores the fair evaluation process facilitated by aggregating diverse neuronal perspectives and outputs.

Experimental Validation

To substantiate their hypothesis, the authors construct LLMEval $^2$ , claimed as the largest and most diverse benchmark for LLM evaluators. Experimental findings indicate that wider networks with two layers significantly improve evaluation accuracy and kappa correlation, from 0.28 to 0.34, aligning closely with human judgments. The architecture demonstrates robust performance across diverse linguistic tasks, emphasizing logical reasoning, semantic understanding, dialogue handling, and multilingual capabilities.

The research further leverages the WideDeep approach for assessing Chinese LLMs, achieving a notable reduction in evaluation time — accelerated by 4.6 times and a 60% cost savings — while reaching a high 93% agreement level with human annotations.

Implications and Future Considerations

The implications of this study are manifold in the field of artificial intelligence, particularly in enhancing automated evaluators' reliability for LLM outputs. The paradigm shift to collaborative LLM networks encapsulates a potential breakthrough in large-scale LLM evaluation. Furthermore, the analogy with academic reviewing introduces a conceptual framework that could be adapted beyond linguistic domains, applicable in various AI assessment scenarios.

Looking ahead, the research invites further exploration of deeper architectures with potentially multi-faceted neurons and varying depths, although it notes a decrease in performance with excessive layering reminiscent of deep learning’s overfitting concerns. The integration of nuanced evaluation roles derived dynamically from LLM outputs opens doors for more sophisticated evaluative models across wider application areas.

Overall, the paper contributes substantively to establishing a more effective evaluation mechanism for LLMs, promoting fairness and alignment with human values in machine-generated content. The proposed architecture and benchmarking initiative also empower a more comprehensive exploration of linguistic capabilities within LLMs, fostering advancements in model training and deployment strategies in AI.