Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

KIEval: A Knowledge-grounded Interactive Evaluation Framework for Large Language Models (2402.15043v2)

Published 23 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Automatic evaluation methods for LLMs are hindered by data contamination, leading to inflated assessments of their effectiveness. Existing strategies, which aim to detect contaminated texts, focus on quantifying contamination status instead of accurately gauging model performance. In this paper, we introduce KIEval, a Knowledge-grounded Interactive Evaluation framework, which incorporates an LLM-powered "interactor" role for the first time to accomplish a dynamic contamination-resilient evaluation. Starting with a question in a conventional LLM benchmark involving domain-specific knowledge, KIEval utilizes dynamically generated, multi-round, and knowledge-focused dialogues to determine whether a model's response is merely a recall of benchmark answers or demonstrates a deep comprehension to apply knowledge in more complex conversations. Extensive experiments on seven leading LLMs across five datasets validate KIEval's effectiveness and generalization. We also reveal that data contamination brings no contribution or even negative effect to models' real-world applicability and understanding, and existing contamination detection methods for LLMs can only identify contamination in pre-training but not during supervised fine-tuning.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (88)
  1. 01.AI. 2023. Yi-6b model by 01-ai. https://01.ai/.
  2. Palm 2 technical report. arXiv preprint arXiv:2305.10403.
  3. A closer look at memorization in deep networks. In International conference on machine learning, pages 233–242. PMLR.
  4. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  5. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023.
  6. Open llm leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  7. Yoshua Bengio and Yann LeCun. 2007. Scaling learning algorithms towards AI. In Large Scale Kernel Machines. MIT Press.
  8. The reversal curse: Llms trained on" a is b" fail to learn" b is a". arXiv preprint arXiv:2309.12288.
  9. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158.
  10. Pythia: A suite for analyzing large language models across training and scaling. arXiv preprint arXiv:2304.01373.
  11. Holistic evaluation of language models. Annals of the New York Academy of Sciences.
  12. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  13. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  14. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201.
  15. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109.
  16. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? arXiv preprint arXiv:2305.01937.
  17. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
  18. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457.
  19. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  20. Amplify-instruct: Synthetically generated diverse multi-turn conversations for effecient llm training. arXiv preprint arXiv:(comming soon).
  21. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. arXiv preprint arXiv:2304.03208.
  22. Fine-tuning pretrained language models: Weight initializations, data orders, and early stopping. arXiv preprint arXiv:2002.06305.
  23. Robustness challenges in model distillation and pruning for natural language understanding. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pages 1758–1770.
  24. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387.
  25. Dom Eccleston. 2023. Sharegpt dataset. https://sharegpt.com/.
  26. Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166.
  27. A framework for few-shot language model evaluation.
  28. Deep learning tuning playbook. Version 1.0.
  29. Deep learning, volume 1. MIT Press.
  30. Google. 2023. Bard.
  31. Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736.
  32. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  33. A fast learning algorithm for deep belief nets. Neural Computation, 18:1527–1554.
  34. Lynette Hirschman and Robert Gaizauskas. 2001. Natural language question answering: the view from here. natural language engineering, 7(4):275–300.
  35. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, 9(8):1735–1780.
  36. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  37. C-eval: A multi-level multi-discipline chinese evaluation suite for foundation models. arXiv preprint arXiv:2305.08322.
  38. Mistral 7b. arXiv preprint arXiv:2310.06825.
  39. The perils of using mechanical turk to evaluate open-ended text generation. arXiv preprint arXiv:2109.06835.
  40. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  41. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466.
  42. AlpacaEval: An Automatic Evaluator of Instruction-following Models.
  43. Yucheng Li. 2023. An open source data contamination report for llama series models. arXiv preprint arXiv:2310.17589.
  44. Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81.
  45. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
  46. Yen-Ting Lin and Yun-Nung Chen. 2023. Llm-eval: Unified multi-dimensional automatic evaluation for open-domain conversations with large language models. arXiv preprint arXiv:2305.13711.
  47. Meta semantic template for evaluation of large language models. arXiv preprint arXiv:2310.01448.
  48. Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
  49. Large language models for structured reporting in radiology: performance of gpt-4, chatgpt-3.5, perplexity and bing. La radiologia medica, pages 1–5.
  50. MosaicML. 2023. Introducing mpt-7b: A new standard for open-source, commercially usable llms.
  51. Why we need new evaluation metrics for nlg. arXiv preprint arXiv:1707.06875.
  52. OpenAI. 2023. Gpt-4 technical report.
  53. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623.
  54. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  55. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  56. Validity problems comparing values across cultures and possible solutions. Psychological methods, 2(4):329.
  57. Mauve: Measuring the gap between neural text and human text using divergence frontiers. Advances in Neural Information Processing Systems, 34:4816–4828.
  58. Improving language understanding by generative pre-training.
  59. Zero-infinity: Breaking the gpu memory wall for extreme scale deep learning. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–14.
  60. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  61. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark. arXiv preprint arXiv:2310.18018.
  62. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100.
  63. Rylan Schaeffer. 2023. Pretraining on the test set is all you need. arXiv preprint arXiv:2309.08632.
  64. Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789.
  65. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.
  66. How to fine-tune bert for text classification? In Chinese Computational Linguistics: 18th China National Conference, CCL 2019, Kunming, China, October 18–20, 2019, Proceedings 18, pages 194–206. Springer.
  67. iEval: Interactive evaluation framework for open-domain empathetic chatbots. In Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, pages 419–431, Edinburgh, UK. Association for Computational Linguistics.
  68. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  69. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  70. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  71. Natural language processing with transformers. " O’Reilly Media, Inc.".
  72. Attention is all you need. Advances in neural information processing systems, 30.
  73. Glue: A multi-task benchmark and analysis platform for natural language understanding. In International Conference on Learning Representations.
  74. Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926.
  75. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. arXiv preprint arXiv:2306.05087.
  76. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  77. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341.
  78. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  79. Clue: A chinese language understanding evaluation benchmark. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4762–4772.
  80. Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073.
  81. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830.
  82. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414.
  83. Evaluating large language models at evaluating instruction following. arXiv preprint arXiv:2310.07641.
  84. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  85. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675.
  86. Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
  87. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964.
  88. Dyval: Graph-informed dynamic evaluation of large language models. arXiv preprint arXiv:2309.17167.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Zhuohao Yu (15 papers)
  2. Chang Gao (54 papers)
  3. Wenjin Yao (3 papers)
  4. Yidong Wang (43 papers)
  5. Wei Ye (110 papers)
  6. Jindong Wang (150 papers)
  7. Xing Xie (220 papers)
  8. Yue Zhang (620 papers)
  9. Shikun Zhang (82 papers)
Citations (11)