Evaluating Large Language Models on the GMAT: Implications for the Future of Business Education (2401.02985v1)
Abstract: The rapid evolution of AI, especially in the domain of LLMs and generative AI, has opened new avenues for application across various fields, yet its role in business education remains underexplored. This study introduces the first benchmark to assess the performance of seven major LLMs, OpenAI's models (GPT-3.5 Turbo, GPT-4, and GPT-4 Turbo), Google's models (PaLM 2, Gemini 1.0 Pro), and Anthropic's models (Claude 2 and Claude 2.1), on the GMAT, which is a key exam in the admission process for graduate business programs. Our analysis shows that most LLMs outperform human candidates, with GPT-4 Turbo not only outperforming the other models but also surpassing the average scores of graduate students at top business schools. Through a case study, this research examines GPT-4 Turbo's ability to explain answers, evaluate responses, identify errors, tailor instructions, and generate alternative scenarios. The latest LLM versions, GPT-4 Turbo, Claude 2.1, and Gemini 1.0 Pro, show marked improvements in reasoning tasks compared to their predecessors, underscoring their potential for complex problem-solving. While AI's promise in education, assessment, and tutoring is clear, challenges remain. Our study not only sheds light on LLMs' academic potential but also emphasizes the need for careful development and application of AI in education. As AI technology advances, it is imperative to establish frameworks and protocols for AI interaction, verify the accuracy of AI-generated content, ensure worldwide access for diverse learners, and create an educational environment where AI supports human expertise. This research sets the stage for further exploration into the responsible use of AI to enrich educational experiences and improve exam preparation and assessment methods.
- A smart personal ai assistant for visually impaired people. In 2018 2nd international conference on trends in electronics and informatics (ICOEI), pages 1245–1250. IEEE, 2018.
- Chatbot for health care and oncology applications using artificial intelligence and machine learning: systematic review. JMIR cancer, 7(4):e27850, 2021.
- Systematic review of research on artificial intelligence applications in higher education–where are the educators? International Journal of Educational Technology in Higher Education, 16(1):1–27, 2019.
- The promise of artificial intelligence: a review of the opportunities and challenges of artificial intelligence in healthcare. British medical bulletin, 139(1):4–15, 2021.
- Development and deployment of a large-scale dialog-based intelligent tutoring system. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Industry Papers), pages 114–121. Association for Computational Linguistics, 2019.
- Improving language understanding by generative pre-training. 2018.
- Sundar Pichai. An important next step on our ai journey, 2023. https://blog.google/technology/ai/bard-google-ai-search-updates/, Last accessed on 2023-11-26.
- Anthropic. Claude 2, 2023. https://www.anthropic.com/index/claude-2, Last accessed on 2023-11-26.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Gpt-3: Its nature, scope, limits, and consequences. Minds and Machines, 30:681–694, 2020.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- OpenAI. Gpt-4 technical report, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Reading comprehension quiz generation using generative pre-trained transformers, 2022.
- Quiz maker: Automatic quiz generation from text using nlp. In Futuristic Trends in Networks and Computing Technologies: Select Proceedings of Fourth International Conference on FTNCT 2021, pages 523–533. Springer, 2022.
- Chatgpt for good? on opportunities and challenges of large language models for education. Learning and individual differences, 103:102274, 2023.
- Ask to learn: A study on curiosity-driven question generation. arXiv preprint arXiv:1911.03350, 2019.
- Gpt-3-driven pedagogical agents to train children’s curious question-asking skills. International Journal of Artificial Intelligence in Education, pages 1–36, 2023.
- Vahid Ashrafimoghari. Big data and education: using big data analytics in language learning. arXiv preprint arXiv:2207.10572, 2022.
- Assessing the quality of student-generated short answer questions using gpt-3. In European conference on technology enhanced learning, pages 243–257. Springer, 2022.
- The effect of automated feedback on revision behavior and learning gains in formative assessment of scientific argument writing. Computers & Education, 143:103668, 2020.
- Adaptive feedback from artificial neural networks facilitates pre-service teachers’ diagnostic reasoning in simulation-based learning. Learning and Instruction, 83:101620, 2023.
- Machine learning based feedback on textual student answers in large courses. Computers and Education: Artificial Intelligence, 3:100081, 2022.
- Sal Khan. Harnessing gpt-4 so that all students benefit. a nonprofit approach for equal access. Khan Academy, 2023.
- Assessing the usability of chatgpt for formal english language learning. European Journal of Investigation in Health, Psychology and Education, 13(9):1937–1960, 2023.
- Large language models for difficulty estimation of foreign language content with application to language learning. arXiv preprint arXiv:2309.05142, 2023.
- Generating diverse code explanations using the gpt-3 large language model. In Proceedings of the 2022 ACM Conference on International Computing Education Research-Volume 2, pages 37–39, 2022.
- Towards automated generation and evaluation of questions in educational domains. In Proceedings of the 15th International Conference on Educational Data Mining, volume 701, 2022.
- Automating human tutor-style programming feedback: Leveraging gpt-4 tutor model for hint generation and gpt-3.5 student model for hint validation. arXiv preprint arXiv:2310.03780, 2023.
- The digital metaverse: Applications in artificial intelligence, medical education, and integrative health. Integrative Medicine Research, 12(1):100917, 2023.
- Digital transformations of classrooms in virtual reality. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–10, 2021.
- Maria Roussou. Immersive interactive virtual reality in the museum. Proc. of TiLE (Trends in Leisure Entertainment), 2001.
- Artificial intelligence and communication: A human–machine communication research agenda. New media & society, 22(1):70–86, 2020.
- Performance of chatgpt on usmle: Potential for ai-assisted medical education using large language models. PLoS digital health, 2(2):e0000198, 2023.
- Chatgpt goes to law school. Available at SSRN, 2023.
- Ai-generated medical advice—gpt and beyond. Jama, 329(16):1349–1350, 2023.
- Gpt-4 passes the bar exam. Available at SSRN 4389233, 2023.
- Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
- A comprehensive capability analysis of gpt-3 and gpt-3.5 series models. arXiv preprint arXiv:2303.10420, 2023.
- R OpenAI. Gpt-4 technical report. arXiv:2303.08774v3, 2:3, 2023.
- Open AI. Gpt-4 turbo, 2023. https://help.openai.com/en/articles/8555510-gpt-4-turbo, Last accessed on 2023-12-26.
- Open AI. Vision, 2023. https://platform.openai.com/docs/guides/vision, Last accessed on 2023-12-26.
- Anthropic. Introducing claude 2.1, 2023. https://www.anthropic.com/index/claude-2-1, Last accessed on 2023-12-26.
- Palm 2 technical report. arXiv preprint arXiv:2305.10403, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- Google. Bard gets its biggest upgrade yet with gemini, 2023. https://blog.google/products/bard/google-bard-try-gemini-ai/, Last accessed on 2023-12-26.
- GMAC. Gmat by the numbers, 2023. https://www.mba.com/exams/gmat-exam, accessed on 2023-11-26.
- Solving quantitative reasoning problems with language models. Advances in Neural Information Processing Systems, 35:3843–3857, 2022.
- Recent advances in natural language processing via large pre-trained language models: A survey. ACM Computing Surveys, 56(2):1–40, 2023.
- Phenomenal yet puzzling: Testing inductive reasoning capabilities of language models with hypothesis refinement. arXiv preprint arXiv:2310.08559, 2023.
- Complementary advantages of chatgpts and human readers in reasoning: Evidence from english text reading comprehension. arXiv preprint arXiv:2311.10344, 2023.
- Time-llm: Time series forecasting by reprogramming large language models. arXiv preprint arXiv:2310.01728, 2023.
- Leveraging chatgpt for enhancing critical thinking skills. Journal of Chemical Education, 2023.
- Csaba Veres. Large language models are not models of natural language: they are corpus models. IEEE Access, 10:61970–61979, 2022.
- A survey on evaluation of large language models. arXiv preprint arXiv:2307.03109, 2023.
- GMAC. Understanding your score, 2023. https://www.mba.com/exams/gmat-exam/scores/understanding-your-score, Last accessed on 2023-11-26.
- Financial Times. Business school rankings, 2023. https://rankings.ft.com/rankings/2909/mba-2023, Last accessed on 2023-12-05.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
- Promptchainer: Chaining large language model prompts through visual programming. In CHI Conference on Human Factors in Computing Systems Extended Abstracts, pages 1–10, 2022.
- Retrieval-augmented generation to improve math question-answering: Trade-offs between groundedness and human preference. arXiv preprint arXiv:2310.03184, 2023.
- Can large language models reason about medical questions? arXiv preprint arXiv:2207.08143, 2022.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- Ai: new source of competitiveness in higher education. Competitiveness Review: An International Business Journal, 33(2):265–279, 2023.
- Comparing instructor and student perspectives of online versus face-to-face education for program factors during the pandemic. Journal of Education for Business, 98(8):452–461, 2023.
- Giriraj Kiradoo. Unlocking the potential of ai in business: Challenges and ethical considerations. Recent Progress in Science and Technology, 6:205–220, 2023.
- Illustrating reinforcement learning from human feedback (rlhf). Hugging Face Blog, 2022. https://huggingface.co/blog/rlhf.
- Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28, 1979.
- Culture as consensus: A theory of culture and informant accuracy. American anthropologist, 88(2):313–338, 1986.
- Harnessing collective intelligence under a lack of cultural consensus. arXiv preprint arXiv:2309.09787, 2023.
- Vahid Ashrafimoghari (4 papers)
- Necdet Gürkan (6 papers)
- Jordan W. Suchow (17 papers)