Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Using Large Language Models to Assess Tutors' Performance in Reacting to Students Making Math Errors (2401.03238v1)

Published 6 Jan 2024 in cs.HC, cs.AI, and cs.CY

Abstract: Research suggests that tutors should adopt a strategic approach when addressing math errors made by low-efficacy students. Rather than drawing direct attention to the error, tutors should guide the students to identify and correct their mistakes on their own. While tutor lessons have introduced this pedagogical skill, human evaluation of tutors applying this strategy is arduous and time-consuming. LLMs show promise in providing real-time assessment to tutors during their actual tutoring sessions, yet little is known regarding their accuracy in this context. In this study, we investigate the capacity of generative AI to evaluate real-life tutors' performance in responding to students making math errors. By analyzing 50 real-life tutoring dialogues, we find both GPT-3.5-Turbo and GPT-4 demonstrate proficiency in assessing the criteria related to reacting to students making errors. However, both models exhibit limitations in recognizing instances where the student made an error. Notably, GPT-4 tends to overidentify instances of students making errors, often attributing student uncertainty or inferring potential errors where human evaluators did not. Future work will focus on enhancing generalizability by assessing a larger dataset of dialogues and evaluating learning transfer. Specifically, we will analyze the performance of tutors in real-life scenarios when responding to students' math errors before and after lesson completion on this crucial tutoring skill.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588, 2022.
  2. An evaluation of perceptions regarding mentor competencies for technology-based personalized learning. In Society for Information Technology & Teacher Education International Conference, pages 1812–1817. Association for the Advancement of Computing in Education (AACE), 2022.
  3. Can large language models provide feedback to students? a case study on chatgpt. In 2023 IEEE International Conference on Advanced Learning Technologies (ICALT), pages 323–325. IEEE, 2023.
  4. Gpt-3.5, gpt-4, or bard? evaluating llms reasoning ability in zero-shot setting and performance boosting through prompts. Natural Language Processing Journal, 5:100032, 2023.
  5. A blueprint for scaling tutoring and mentoring across public schools. AERA Open, 7:23328584211042858, 2021.
  6. Measuring five accountable talk moves to improve instruction at scale. arXiv preprint arXiv:2311.10749, 2023.
  7. The wisdom of practice: Lessons learned from the study of highly effective tutors. In Improving academic achievement, pages 135–158. Elsevier, 2002.
  8. Motivational techniques of expert human tutors: Lessons for the design of computer-based tutors. computers as cognitive tools. sp lajoie, & sj, derry, hillsdate, 1993.
  9. The work of teaching and the challenge for teacher education. Journal of teacher education, 60(5):497–511, 2009.
  10. Transfer of learning: Concept and process. Social work education, 18(2):183–194, 1999.
  11. The national online tuition pilot. Education Endowment Foundation. https://educationendowmentfoundation. org. uk/projects-and-evaluation/projects/online-tuition-pilot, 2021.
  12. Fostering the intelligent novice: Learning from errors with metacognitive tutoring. In Computers as Metacognitive Tools for Enhancing Learning, pages 257–265. Routledge, 2018.
  13. Effective tutoring techniques: A comparison of human tutors and intelligent tutoring systems. The Journal of the Learning Sciences, 2(3):277–305, 1992.
  14. Tutoring: Guided learning by doing. Cognition and instruction, 13(3):315–372, 1995.
  15. OpenAI. OpenAI, 2023. URL https://www.openai.com/.
  16. Prompt programming for large language models: Beyond the few-shot paradigm. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021.
  17. When the tutor becomes the student: Design and evaluation of efficient scenario-based lessons for tutors. In LAK23: 13th International Learning Analytics and Knowledge Conference, pages 250–261, 2023.
  18. Kappa coefficient: a popular measure of rater agreement. Shanghai archives of psychiatry, 27(1):62, 2015.
  19. Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. arXiv preprint arXiv:2306.03090, 2023.
  20. Step-by-step remediation of students’ mathematical mistakes. arXiv preprint arXiv:2310.10648, 2023.
  21. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  22. Do you think you can? the influence of student self-efficacy on the effectiveness of tutorial dialogue for computer science. International Journal of Artificial Intelligence in Education, 27:130–153, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Sanjit Kakarla (5 papers)
  2. Danielle Thomas (2 papers)
  3. Jionghao Lin (36 papers)
  4. Shivang Gupta (9 papers)
  5. Kenneth R. Koedinger (21 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets