Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 84 tok/s
Gemini 2.5 Pro 37 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 15 tok/s Pro
GPT-4o 86 tok/s Pro
GPT OSS 120B 468 tok/s Pro
Kimi K2 229 tok/s Pro
2000 character limit reached

Self-Evaluation of Large Language Model based on Glass-box Features (2403.04222v2)

Published 7 Mar 2024 in cs.CL

Abstract: The proliferation of open-source LLMs underscores the pressing need for evaluation methods. Existing works primarily rely on external evaluators, focusing on training and prompting strategies. However, a crucial aspect, model-aware glass-box features, is overlooked. In this study, we explore the utility of glass-box features under the scenario of self-evaluation, namely applying an LLM to evaluate its own output. We investigate various glass-box feature groups and discovered that the softmax distribution serves as a reliable quality indicator for self-evaluation. Experimental results on public benchmarks validate the feasibility of self-evaluation of LLMs using glass-box features.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  4. Unsupervised quality estimation for neural machine translation. Transactions of the Association for Computational Linguistics, 8:539–555.
  5. An empirical study of llm-as-a-judge for llm evaluation: Fine-tuned judge models are task-specific classifiers.
  6. Improving translation quality estimation with bias mitigation. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2175–2190.
  7. Generative judge for evaluating alignment. arXiv preprint arXiv:2310.05470.
  8. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval.
  9. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110.
  10. Tangled up in BLEU: Reevaluating the evaluation of automatic machine translation evaluation metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4984–4997, Online. Association for Computational Linguistics.
  11. Is chatgpt a general-purpose natural language processing task solver? arXiv preprint arXiv:2302.06476.
  12. Matīss Rikters and Mark Fishel. 2017. Confidence through attention.
  13. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  14. Attention is all you need. Advances in neural information processing systems, 30.
  15. Label words are anchors: An information flow perspective for understanding in-context learning. In Conference on Empirical Methods in Natural Language Processing.
  16. Large language models are not fair evaluators.
  17. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization.
  18. Yijun Xiao and William Yang Wang. 2019. Quantifying uncertainties in natural language processing tasks. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 7322–7329.
  19. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  20. Judgelm: Fine-tuned large language models are scalable judges. arXiv preprint arXiv:2310.17631.
Citations (1)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.