Self-Reflection in LLM Agents: Effects on Problem-Solving Performance (2405.06682v3)
Abstract: In this study, we investigated the effects of self-reflection in LLMs on problem-solving performance. We instructed nine popular LLMs to answer a series of multiple-choice questions to provide a performance baseline. For each incorrectly answered question, we instructed eight types of self-reflecting LLM agents to reflect on their mistakes and provide themselves with guidance to improve problem-solving. Then, using this guidance, each self-reflecting agent attempted to re-answer the same questions. Our results indicate that LLM agents are able to significantly improve their problem-solving performance through self-reflection ($p < 0.001$). In addition, we compared the various types of self-reflection to determine their individual contribution to performance. All code and data are available on GitHub at https://github.com/matthewrenze/self-reflection
- T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” in Advances in Neural Information Processing Systems, vol. 35, 5 2022, pp. 22 199–22 213. [Online]. Available: https://arxiv.org/abs/2205.11916
- J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” arXiv, 1 2022. [Online]. Available: https://arxiv.org/abs/2201.11903
- Y. Zhou, A. I. Muresanu, Z. Han, K. Paster, S. Pitis, H. Chan, and J. Ba, “Large language models are human-level prompt engineers,” The Eleventh International Conference on Learning Representations, 11 2023. [Online]. Available: https://arxiv.org/abs/2211.01910
- Z. Ji, N. Lee, R. Frieske, T. Yu, D. Su, Y. Xu, E. Ishii, Y. Bang, D. Chen, H. S. Chan, W. Dai, A. Madotto, and P. Fung, “Survey of hallucination in natural language generation,” ACM Computing Surveys, vol. 55, 2 2022. [Online]. Available: http://dx.doi.org/10.1145/3571730
- M. U. Hadi, qasem al tashi, R. Qureshi, A. Shah, amgad muneer, M. Irfan, A. Zafar, M. B. Shaikh, N. Akhtar, J. Wu, S. Mirjalili, Q. Al-Tashi, and A. Muneer, “A survey on large language models: Applications, challenges, limitations, and practical usage,” Authorea Preprints, 10 2023. [Online]. Available: https://doi.org/10.36227/techrxiv.23589741.v1
- A. Payandeh, D. Pluth, J. Hosier, X. Xiao, and V. K. Gurbani, “How susceptible are llms to logical fallacies?” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.09853v1
- J. Huang, X. Chen, S. Mishra, H. S. Zheng, A. W. Yu, X. Song, and D. Zhou, “Large language models cannot self-correct reasoning yet,” arXiv, 10 2023. [Online]. Available: https://arxiv.org/abs/2310.01798
- Z. Ji, T. Yu, Y. Xu, N. Lee, E. Ishii, and P. Fung, “Towards mitigating hallucination in large language models via self-reflection,” arXiv, 10 2023. [Online]. Available: https://arxiv.org/abs/2310.06271
- S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, “Large language models: A survey,” arXiv, 2 2024. [Online]. Available: https://arxiv.org/abs/2402.06196
- N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” arXiv, 3 2023. [Online]. Available: https://arxiv.org/abs/2303.11366
- L. Pan, M. Saxon, W. Xu, D. Nathani, X. Wang, and W. Y. Wang, “Automatically correcting large language models: Surveying the landscape of diverse self-correction strategies,” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.03188
- A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark, “Self-refine: Iterative refinement with self-feedback,” arXiv, 3 2023. [Online]. Available: https://arxiv.org/abs/2303.17651v2
- J. Toy, J. MacAdam, and P. Tabor, “Metacognition is all you need? using introspection in generative agents to improve goal-directed behavior,” arXiv, 1 2024. [Online]. Available: https://arxiv.org/abs/2401.10910
- Y. Wang and Y. Zhao, “Metacognitive prompting improves understanding in large language models,” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.05342
- A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self-rag: Learning to retrieve, generate, and critique through self-reflection,” arXiv, 10 2023. [Online]. Available: https://arxiv.org/abs/2310.11511v1
- G. Wang, Y. Xie, Y. Jiang, A. Mandlekar, C. Xiao, Y. Zhu, A. Anandkumar, U. Austin, and U. Madison, “Voyager: An open-ended embodied agent with large language models,” arXiv, 5 2023. [Online]. Available: https://arxiv.org/abs/2305.16291v2
- Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou, R. Zheng, X. Fan, X. Wang, L. Xiong, Y. Zhou, W. Wang, C. Jiang, Y. Zou, X. Liu, Z. Yin, S. Dou, R. Weng, W. Cheng, Q. Zhang, W. Qin, Y. Zheng, X. Qiu, X. Huang, and T. Gui, “The rise and potential of large language model based agents: A survey,” arXiv, 9 2023. [Online]. Available: https://arxiv.org/abs/2309.07864v3
- S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “React: Synergizing reasoning and acting in language models,” arXiv, 10 2022. [Online]. Available: https://arxiv.org/abs/2210.03629
- N. Miao, Y. W. Teh, and T. Rainforth, “Selfcheck: Using llms to zero-shot check their own step-by-step reasoning,” arXiv, 8 2023. [Online]. Available: https://arxiv.org/abs/2308.00436v3
- R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders, X. Jiang, K. Cobbe, T. Eloundou, G. Krueger, K. Button, M. Knight, B. Chess, and J. S. Openai, “Webgpt: Browser-assisted question-answering with human feedback,” arXiv, 12 2021. [Online]. Available: https://arxiv.org/abs/2112.09332v3
- T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv, 2 2023. [Online]. Available: https://arxiv.org/abs/2302.04761v1
- P. Lewis and et al., “Retrieval-augmented generation for knowledge-intensive nlp tasks,” arXiv, 5 2020. [Online]. Available: https://arxiv.org/abs/2005.11401
- Y. Gao, Y. Xiong, X. Gao, K. Jia, J. Pan, Y. Bi, Y. Dai, J. Sun, M. Wang, and H. Wang, “Retrieval-augmented generation for large language models: A survey,” arXiv, 12 2023. [Online]. Available: https://arxiv.org/abs/2312.10997v5
- W. Zhong, L. Guo, Q. Gao, H. Ye, and Y. Wang, “Memorybank: Enhancing large language models with long-term memory,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, pp. 19 724–19 731, 5 2023. [Online]. Available: https://arxiv.org/abs/2305.10250v3
- Z. Wang, A. Liu, H. Lin, J. Li, X. Ma, and Y. Liang, “Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation,” arXiv, 3 2024. [Online]. Available: https://arxiv.org/abs/2403.05313
- G. Tyen, H. Mansoor, V. Cărbune, P. Chen, and T. Mak, “Llms cannot find reasoning errors, but can correct them!” arXiv, 11 2023. [Online]. Available: https://arxiv.org/abs/2311.08516v2
- P. Clark, I. Cowhey, O. Etzioni, T. Khot, A. Sabharwal, C. Schoenick, and O. Tafjord, “Think you have solved question answering? try arc, the ai2 reasoning challenge,” ArXiv, 3 2018. [Online]. Available: https://arxiv.org/abs/1803.05457
- R. Zellers, A. Holtzman, Y. Bisk, A. Farhadi, and Y. Choi, “Hellaswag: Can a machine really finish your sentence?” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019. [Online]. Available: https://arxiv.org/abs/1905.07830
- A. Pal, L. K. Umapathi, and M. Sankarasubbu, “Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering,” in Proceedings of the Conference on Health, Inference, and Learning. PMLR, 2022, pp. 248–260. [Online]. Available: https://proceedings.mlr.press/v174/pal22a.html
- W. Zhong, R. Cui, Y. Guo, Y. Liang, S. Lu, Y. Wang, A. Saied, W. Chen, and N. Duan, “Agieval: A human-centric benchmark for evaluating foundation models,” ArXiv, 4 2023. [Online]. Available: https://arxiv.org/abs/2304.06364
- J. Liu, L. Cui, H. Liu, D. Huang, Y. Wang, and Y. Zhang, “Logiqa: A challenge dataset for machine reading comprehension with logical reasoning,” in International Joint Conference on Artificial Intelligence, 2020. [Online]. Available: https://arxiv.org/abs/2007.08124
- S. Wang, Z. Liu, W. Zhong, M. Zhou, Z. Wei, Z. Chen, and N. Duan, “From lsat: The progress and challenges of complex reasoning,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 30, pp. 2201–2216, 8 2021. [Online]. Available: https://doi.org/10.1109/TASLP.2022.3164218
- Anthropic, “Introducing the next generation of claude anthropic,” 2024. [Online]. Available: https://www.anthropic.com/news/claude-3-family
- ——, “The claude 3 model family: Opus, sonnet, haiku,” 2024. [Online]. Available: https://www.anthropic.com/claude-3-model-card
- Cohere, “Command r+,” 2024. [Online]. Available: https://docs.cohere.com/docs/command-r-plus
- ——, “Model card for c4ai command r+,” 2024. [Online]. Available: https://huggingface.co/CohereForAI/c4ai-command-r-plus
- S. Pichai and D. Hassabis, “Introducing gemini: Google’s most capable ai model yet,” 2023. [Online]. Available: https://blog.google/technology/ai/google-gemini-ai/
- Gemini-Team, “Gemini: A family of highly capable multimodal models,” arXiv, 12 2023.
- S. Pichai and D. Hassabis, “Introducing gemini 1.5, google’s next-generation ai model,” 2024. [Online]. Available: https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/
- Gemini-Team, “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,” 2024. [Online]. Available: https://arxiv.org/abs/2403.05530
- OpenAI, “Introducing chatgpt,” 11 2022. [Online]. Available: https://openai.com/blog/chatgpt
- ——, “Models - openai api.” [Online]. Available: https://platform.openai.com/docs/models/gpt-3-5-turbo
- ——, “Gpt-4,” 3 2023. [Online]. Available: https://openai.com/research/gpt-4
- ——, “Gpt-4 technical report,” arXiv, 3 2023. [Online]. Available: https://arxiv.org/abs/2303.08774
- Meta, “Meta and microsoft introduce the next generation of llama | meta,” 2023. [Online]. Available: https://about.meta.com/news/2023/07/llama-2/
- H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, D. Bikel, L. Blecher, C. C. Ferrer, M. Chen, G. Cucurull, D. Esiobu, J. Fernandes, J. Fu, W. Fu, B. Fuller, C. Gao, V. Goswami, N. Goyal, A. Hartshorn, S. Hosseini, R. Hou, H. Inan, M. Kardas, V. Kerkez, M. Khabsa, I. Kloumann, A. Korenev, P. S. Koura, M.-A. Lachaux, T. Lavril, J. Lee, D. Liskovich, Y. Lu, Y. Mao, X. Martinet, T. Mihaylov, P. Mishra, I. Molybog, Y. Nie, A. Poulton, J. Reizenstein, R. Rungta, K. Saladi, A. Schelten, R. Silva, E. M. Smith, R. Subramanian, X. E. Tan, B. Tang, R. Taylor, A. Williams, J. X. Kuan, P. Xu, Z. Yan, I. Zarov, Y. Zhang, A. Fan, M. Kambadur, S. Narang, A. Rodriguez, R. Stojnic, S. Edunov, and T. Scialom, “Llama 2: Open foundation and fine-tuned chat models,” arXiv, 7 2023. [Online]. Available: https://arxiv.org/abs/2307.09288
- Mistral-AI-Team, “Au large | mistral ai | frontier ai in your hands,” 2024. [Online]. Available: https://mistral.ai/news/mistral-large/
- S. M. Bsharat, A. Myrzakhan, and Z. Shen, “Principled instructions are all you need for questioning llama-1/2, gpt-3.5/4,” arXiv, 12 2023. [Online]. Available: https://arxiv.org/abs/2312.16171
- M. Renze and E. Guven, “The benefits of a concise chain of thought on problem-solving in large language models,” arXiv, 1 2024. [Online]. Available: https://arxiv.org/abs/2401.05618v1
- ——, “The effect of sampling temperature on problem solving in large language models,” arXiv, 2 2024. [Online]. Available: https://arxiv.org/abs/2402.05201v1
- Q. McNemar, “Note on the sampling error of the difference between correlated proportions or percentages,” Psychometrika, vol. 12, pp. 153–157, 6 1947. [Online]. Available: https://doi.org/10.1007/BF02295996
- Matthew Renze (3 papers)
- Erhan Guven (7 papers)