Self-Evaluation of Large Language Model based on Glass-box Features (2403.04222v2)
Abstract: The proliferation of open-source LLMs underscores the pressing need for evaluation methods. Existing works primarily rely on external evaluators, focusing on training and prompting strategies. However, a crucial aspect, model-aware glass-box features, is overlooked. In this study, we explore the utility of glass-box features under the scenario of self-evaluation, namely applying an LLM to evaluate its own output. We investigate various glass-box feature groups and discovered that the softmax distribution serves as a reliable quality indicator for self-evaluation. Experimental results on public benchmarks validate the feasibility of self-evaluation of LLMs using glass-box features.
- Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
- An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.
- Improving translation quality estimation with bias mitigation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2175–2190.
- Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
- Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
- Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
- Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
- Matīss Rikters and Mark Fishel. 2017. Confidence through attention.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Attention is all you need. Advances in neural information processing systems, 30.
- Label words are anchors: An information flow perspective for understanding in-context learning. In Conference on Empirical Methods in Natural Language Processing.
- Large language models are not fair evaluators.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
- Yijun Xiao and William Yang Wang. 2019. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7322–7329.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.