Dynamic Evaluation of Large Language Models by Meta Probing Agents (2402.14865v2)
Abstract: Evaluation of LLMs has raised great concerns in the community due to the issue of data contamination. Existing work designed evaluation protocols using well-defined algorithms for specific tasks, which cannot be easily extended to diverse scenarios. Moreover, current evaluation benchmarks can only provide the overall benchmark results and cannot support a fine-grained and multifaceted analysis of LLMs' abilities. In this paper, we propose meta probing agents (MPA), a general dynamic evaluation protocol inspired by psychometrics to evaluate LLMs. MPA is the key component of DyVal 2, which naturally extends the previous DyVal~\citep{zhu2023dyval}. MPA designs the probing and judging agents to automatically transform an original evaluation problem into a new one following psychometric theory on three basic cognitive abilities: language understanding, problem solving, and domain knowledge. These basic abilities are also dynamically configurable, allowing multifaceted analysis. We conducted extensive evaluations using MPA and found that most LLMs achieve poorer performance, indicating room for improvement. Our multifaceted analysis demonstrated the strong correlation between the basic abilities and an implicit Matthew effect on model size, i.e., larger models possess stronger correlations of the abilities. MPA can also be used as a data augmentation approach to enhance LLMs. Code is available at: https://github.com/microsoft/promptbench.
- 01-ai. Yi: A series of large language models. https://github.com/01-ai/Yi, 2024.
- Benchmarking foundation models with language-model-as-an-examiner. arXiv preprint arXiv:2306.04181, 2023.
- BIG bench authors. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
- On the dangers of stochastic parrots: Can language models be too big? FAccT 2021, page 610–623, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383097.
- The reversal curse: Llms trained on “a is b” fail to learn “b is a”. arXiv preprint arXiv:2309.12288, 2023.
- Emergent and predictable memorization in large language models. arXiv preprint arXiv:2304.11158, 2023.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Revealing the structure of language model capabilities. arXiv preprint arXiv:2306.10062, 2023.
- Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. arXiv preprint arXiv:2312.09390, 2023.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
- Alpacafarm: A simulation framework for methods that learn from human feedback. arXiv preprint arXiv:2305.14387, 2023.
- Nphardeval: Dynamic benchmark on reasoning ability of large language models via complexity classes. arXiv preprint arXiv:2312.14890, 2023.
- The devil is in the errors: Leveraging large language models for fine-grained machine translation evaluation. arXiv preprint arXiv:2308.07286, 2023.
- Adaptive testing of computer vision models. arXiv preprint arXiv:2212.02774, 2022.
- GeminiTeam. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Data contamination quiz: A tool to detect and estimate contamination in large language models. arXiv preprint arXiv:2311.06233, 2023a.
- Time travel in llms: Tracing data contamination in large language models. arXiv preprint arXiv:2308.08493, 2023b.
- Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021.
- Metagpt: Meta programming for multi-agent collaborative framework. arXiv preprint arXiv:2308.00352, 2023.
- HuggingFace. Open-source large language models leaderboard. https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard, 2023.
- Maurice G Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938.
- Dynabench: Rethinking benchmarking in NLP. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4110–4124, June 2021.
- Chatgpt: Jack of all trades, master of none. Information Fusion, page 101861, 2023.
- S3eval: A synthetic, scalable, systematic evaluation suite for large language models. arXiv preprint arXiv:2310.15147, 2023.
- Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
- Alpacaeval: An automatic evaluator of instruction-following models, 2023b.
- Yucheng Li. An open source data contamination report for llama series models. arXiv preprint arXiv:2310.17589, 2023.
- Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. ISSN 2835-8856.
- Haoxiong Liu and Andrew Chi-Chih Yao. Augmenting math word problems via iterative question composing. arXiv preprint arXiv:2401.09003, 2024.
- Brian Lovin. Gpt-4 performs significantly worse on coding problems not in its training data. https://brianlovin.com/hn/35297067, 2023.
- Dynaboard: An evaluation-as-a-service platform for holistic next-generation benchmarking. Advances in Neural Information Processing Systems, 34:10351–10367, 2021.
- Robert K Merton. The matthew effect in science: The reward and communication systems of science are considered. Science, 159(3810):56–63, 1968.
- MistralAITeam. Mixtral-8x7b-v0.1. https://huggingface.co/mistralai/Mixtral-8x7B-v0.1, 2023.
- OpenAI. https://chat.openai.com.chat, 2023a.
- OpenAI. Gpt-4 technical report, 2023b.
- Proving test set contamination in black box language models. arXiv preprint arXiv:2310.17623, 2023.
- Karl Pearson. The history and theory of correlation. Biometrika Office, 1920.
- Introduction to psychometric theory. Routledge, 2011.
- Adaptive testing and debugging of nlp models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3253–3267, 2022.
- Beyond accuracy: Behavioral testing of NLP models with CheckList. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4902–4912, Online, July 2020. Association for Computational Linguistics.
- Are emergent abilities of large language models a mirage? In NeurIPS, 2023.
- Significant-Gravitas. Autogpt. https://github.com/Significant-Gravitas/AutoGPT, 2023.
- C Spearman. The proof and measurement of association between two things. The American Journal of Psychology, 15(1):72–101, 1904.
- Charles Spearman. " general intelligence" objectively determined and measured. 1961.
- Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Deeptest: Automated testing of deep-neural-network-driven autonomous cars. In Proceedings of the 40th international conference on software engineering, pages 303–314, 2018.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023.
- Pandalm: An automatic evaluation benchmark for llm instruction tuning optimization. 2024.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560, 2022.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Skywork: A more open bilingual foundation model. arXiv preprint arXiv:2310.19341, 2023.
- Glue-x: Evaluating natural language understanding models from an out-of-distribution generalization perspective. arXiv preprint arXiv:2211.08073, 2022.
- Rethinking benchmark and contamination for language models with rephrased samples. arXiv preprint arXiv:2311.04850, 2023.
- Metamath: Bootstrap your own mathematical questions for large language models. arXiv preprint arXiv:2309.12284, 2023.
- Self-rewarding language models. arXiv preprint arXiv:2401.10020, 2024.
- Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023.
- Don’t make your llm an evaluation benchmark cheater. arXiv preprint arXiv:2311.01964, 2023.
- Dyval: Graph-informed dynamic evaluation of large language models. arXiv preprint arXiv:2309.17167, 2023a.
- Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023b.
- Fool your (vision and) language model with embarrassingly simple permutations. arXiv preprint arXiv:2310.01651, 2023.
- Kaijie Zhu (19 papers)
- Jindong Wang (150 papers)
- Qinlin Zhao (5 papers)
- Ruochen Xu (35 papers)
- Xing Xie (220 papers)