Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Dynamic Evaluation of Large Language Models by Meta Probing Agents (2402.14865v2)

Published 21 Feb 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Evaluation of LLMs has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs. Code is available at: https://github.com/microsoft/promptbench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. 01-ai. Yi: A series of large language models. https://github.com/01-ai/Yi, 2024.
  2. Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181, 2023.
  3. BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  4. On the dangers of stochastic parrots: Can language models be too big? FAccT 2021, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097.
  5. The reversal curse: Llms trained on “a is b” fail to learn “b is a”. arXiv preprint arXiv:2309.12288, 2023.
  6. Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023.
  9. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
  10. A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
  11. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
  12. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  13. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  14. Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
  15. Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. arXiv preprint arXiv:2312.14890, 2023.
  16. The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. arXiv preprint arXiv:2308.07286, 2023.
  17. Adaptive testing of computer vision models. arXiv preprint arXiv:2212.02774, 2022.
  18. GeminiTeam. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  19. Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233, 2023a.
  20. Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023b.
  21. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
  22. Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
  23. HuggingFace. Open-source large language models leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
  24. Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
  25. Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, June 2021.
  26. Chatgpt: Jack of all trades, master of none. Information Fusion, page 101861, 2023.
  27. S3eval: A synthetic, scalable, systematic evaluation suite for large language models. arXiv preprint arXiv:2310.15147, 2023.
  28. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
  29. Alpacaeval: An automatic evaluator of instruction-following models, 2023b.
  30. Yucheng Li. An open source data contamination report for llama series models. arXiv preprint arXiv:2310.17589, 2023.
  31. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
  32. Haoxiong Liu and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024.
  33. Brian Lovin. Gpt-4 performs significantly worse on coding problems not in its training data. https://brianlovin.com/hn/35297067, 2023.
  34. Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Advances in Neural Information Processing Systems, 34:10351–10367, 2021.
  35. Robert K Merton. The matthew effect in science: The reward and communication systems of science are considered. Science, 159(3810):56–63, 1968.
  36. MistralAITeam. Mixtral-8x7b-v0.1. https://huggingface.co/mistralai/Mixtral-8x7B-v0.1, 2023.
  37. OpenAI. https://chat.openai.com.chat, 2023a.
  38. OpenAI. Gpt-4 technical report, 2023b.
  39. Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623, 2023.
  40. Karl Pearson. The history and theory of correlation. Biometrika Office, 1920.
  41. Introduction to psychometric theory. Routledge, 2011.
  42. Adaptive testing and debugging of nlp models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3253–3267, 2022.
  43. Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics.
  44. Are emergent abilities of large language models a mirage? In NeurIPS, 2023.
  45. Significant-Gravitas. Autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2023.
  46. C Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904.
  47. Charles Spearman. " general intelligence" objectively determined and measured. 1961.
  48. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  49. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  50. Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering, pages 303–314, 2018.
  51. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  52. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023.
  53. Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. 2024.
  54. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  55. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  56. Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
  57. Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073, 2022.
  58. Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
  59. Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
  60. Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
  61. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
  62. Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023.
  63. Dyval: Graph-informed dynamic evaluation of large language models. arXiv preprint arXiv:2309.17167, 2023a.
  64. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023b.
  65. Fool your (vision and) language model with embarrassingly simple permutations. arXiv preprint arXiv:2310.01651, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Kaijie Zhu (19 papers)
  2. Jindong Wang (150 papers)
  3. Qinlin Zhao (5 papers)
  4. Ruochen Xu (35 papers)
  5. Xing Xie (220 papers)
Citations (15)

Summary

  • The paper introduces the MPA framework to dynamically generate evaluation samples, uncovering a 15.7% performance drop in GPT-4-Turbo on dynamic benchmarks.
  • It employs probing and judge agents rooted in psychometrics to assess language understanding and problem-solving skills comprehensively.
  • The framework enables robust data augmentation that enhances LLM performance through fine-tuning on dynamically generated samples.

Dynamic Evaluation with Meta Probing Agents: A Critical Overview

The paper "DyVal 2: Dynamic Evaluation of LLMs by Meta Probing Agents" presents an innovative approach to evaluating LLMs through a novel dynamic evaluation protocol known as Meta Probing Agents (MPA). This framework is chiefly inspired by psychometrics and aims to address two significant challenges in LLM evaluation: the problem of data contamination and the need for a multifaceted analysis of model capabilities.

Core Contributions and Methodology

The primary contribution of the paper is the development of the MPA framework, which distinguishes itself from traditional evaluation methods by dynamically generating evaluation samples. Unlike static benchmarks that may inadvertently contribute to data contamination through overfitting, MPA supports a more versatile and comprehensive analysis by employing a dynamic evaluation paradigm. This flexibility is crucial for an accurate assessment of LLMs, which have shown impressive yet opaque skillsets due to their scale and the breadth of their training data.

MPA operates through two central components: probing agents and judge agents. The probing agents, based on various psychometrically inspired principles, transform existing evaluation problems into new ones, focusing on core cognitive abilities such as language understanding, problem-solving, and domain knowledge. The judge agents, in turn, validate this transformation to ensure the new evaluations maintain consistency with the original tasks. This agent-based design allows for nuanced benchmarking where the LLMs' multifaceted cognitive capabilities can be assessed and analyzed.

Empirical Findings

The paper reports empirical results obtained from evaluating several prominent LLMs, including GPT-4-Turbo, GPT-3.5-Turbo, and Gemini-Pro, against both traditional benchmarks and those meticulously generated through MPA. Notably, the evaluation highlights a significant drop in performance on the dynamic benchmarks, suggesting that the performance of existing models on static benchmarks may be inflated due to potential data contamination. For instance, the paper finds that GPT-4-Turbo's performance on the MMLU dataset decreases by approximately 15.7% when subjected to MPA evaluation, underscoring areas for improvement.

The authors further conducted a detailed analysis of different probing principles which indicated that language understanding, as well as problem-solving, play pivotal roles in the models' performance decline. When dissected into various combinations of principles, the findings emphasized that more complex configurations yielded broader performance degradation across models.

Theoretical and Practical Implications

The theoretical implications of the paper are profound; it contributes to our understanding of the inherent structure of LLMs' cognitive abilities, which, according to the findings, exhibit strong internal correlations. The transparency promoted by such evaluations encourages more refined architectures for future LLM developments and highlights the significant ‘Matthew effect’, wherein larger model sizes correlate with stronger ability correlations.

Practically, MPA not only serves as an effective evaluation tool but also opens up avenues for data augmentation, which was demonstrated by the paper through the fine-tuning of GPT-3.5-Turbo. Enhanced training datasets, derived through the MPA approach, resulted in improved model performances, indicating that the use of MPA can assist in the creation of robust training datasets for future model iterations.

Future Directions and Limitations

Future research should explore incorporating a wider array of evaluation tasks to provide an even more comprehensive understanding of LLMs' capabilities. Additionally, while the utilization of sophisticated agents contributes significantly to MPA's robustness, there remains room to improve the alignment between generated and original questions to minimize inconsistencies.

In conclusion, MPA provides a substantial advancement in the evaluation of LLMs, positioning it as a critical tool for both model assessment and development. By aligning the evaluation process more closely with human cognitive theories, this paper paves the way for a deeper, more structured exploration of LLM capabilities, thereby fostering the more nuanced development of future AI systems.