Large Language Model Evaluation Via Multi AI Agents: Preliminary results (2404.01023v1)
Abstract: As LLMs have become integral to both research and daily operations, rigorous evaluation is crucial. This assessment is important not only for individual tasks but also for understanding their societal impact and potential risks. Despite extensive efforts to examine LLMs from various perspectives, there is a noticeable lack of multi-agent AI models specifically designed to evaluate the performance of different LLMs. To address this gap, we introduce a novel multi-agent AI model that aims to assess and compare the performance of various LLMs. Our model consists of eight distinct AI agents, each responsible for retrieving code based on a common description from different advanced LLMs, including GPT-3.5, GPT-3.5 Turbo, GPT-4, GPT-4 Turbo, Google Bard, LLAMA, and Hugging Face. Our developed model utilizes the API of each LLM to retrieve code for a given high-level description. Additionally, we developed a verification agent, tasked with the critical role of evaluating the code generated by its counterparts. We integrate the HumanEval benchmark into our verification agent to assess the generated code's performance, providing insights into their respective capabilities and efficiencies. Our initial results indicate that the GPT-3.5 Turbo model's performance is comparatively better than the other models. This preliminary analysis serves as a benchmark, comparing their performances side by side. Our future goal is to enhance the evaluation process by incorporating the Massively Multitask Benchmark for Python (MBPP) benchmark, which is expected to further refine our assessment. Additionally, we plan to share our developed model with twenty practitioners from various backgrounds to test our model and collect their feedback for further improvement.
- Structural language models of code. In International conference on machine learning, pp. 245–256. PMLR, 2020.
- Program synthesis with large language models. arXiv preprint arXiv:2108.07732, 2021.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 2023.
- Codet: Code generation with generated tests. arXiv preprint arXiv:2207.10397, 2022.
- Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.
- Gpts are gpts: An early look at the labor market impact potential of large language models. arXiv preprint arXiv:2303.10130, 2023.
- Investigating code generation performance of chat-gpt with crowdsourcing social data. In Proceedings of the 47th IEEE Computer Software and Applications Conference, pp. 1–10, 2023.
- Codebert: A pre-trained model for programming and natural languages. arXiv preprint arXiv:2002.08155, 2020.
- Unixcoder: Unified cross-modal pre-training for code representation. arXiv preprint arXiv:2203.03850, 2022.
- Evaluating large language models: A comprehensive survey. arXiv preprint arXiv:2310.19736, 2023.
- Are deep neural networks the best choice for modeling source code? In Proceedings of the 2017 11th Joint meeting on foundations of software engineering, pp. 763–773, 2017.
- Large language models for software engineering: A systematic literature review. arXiv preprint arXiv:2308.10620, 2023.
- Big code!= big vocabulary: Open-vocabulary models for source code. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, pp. 1073–1085, 2020.
- Applying codebert for automated program repair of java simple bugs. In 2021 IEEE/ACM 18th International Conference on Mining Software Repositories (MSR), pp. 505–509. IEEE, 2021.
- Improving language understanding by generative pre-training. 2018.
- Autonomous agents in software development: A vision paper. arXiv preprint arXiv:2311.18440, 2023.
- Can large language models serve as data analysts? a multi-agent assisted approach for qualitative data analysis. arXiv preprint arXiv:2402.01386, 2024a.
- Codepori: Large scale model for autonomous software development by using multi-agents. arXiv preprint arXiv:2402.01411, 2024b.
- War of the chatbots: Bard, bing chat, chatgpt, ernie and beyond. the new ai gold rush and its impact on higher education. Journal of Applied Learning and Teaching, 6(1), 2023.
- System for systematic literature review using multiple ai agents: Concept and an empirical evaluation. arXiv preprint arXiv:2403.08399, 2024.
- Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Software testing with large language model: Survey, landscape, and vision. arXiv preprint arXiv:2307.07221, 2023.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- A systematic evaluation of large language models of code. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming, pp. 1–10, 2022.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Terry Yue Zhuo. Large language models are state-of-the-art evaluators of code generation. arXiv preprint arXiv:2304.14317, 2023.
- Zeeshan Rasheed (23 papers)
- Muhammad Waseem (66 papers)
- Kari Systä (11 papers)
- Pekka Abrahamsson (105 papers)