Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Automated Evaluation of Classroom Instructional Support with LLMs and BoWs: Connecting Global Predictions to Specific Feedback (2310.01132v4)

Published 2 Oct 2023 in cs.CL and cs.AI

Abstract: With the aim to provide teachers with more specific, frequent, and actionable feedback about their teaching, we explore how LLMs can be used to estimate ``Instructional Support'' domain scores of the CLassroom Assessment Scoring System (CLASS), a widely used observation protocol. We design a machine learning architecture that uses either zero-shot prompting of Meta's Llama2, and/or a classic Bag of Words (BoW) model, to classify individual utterances of teachers' speech (transcribed automatically using OpenAI's Whisper) for the presence of Instructional Support. Then, these utterance-level judgments are aggregated over a 15-min observation session to estimate a global CLASS score. Experiments on two CLASS-coded datasets of toddler and pre-kindergarten classrooms indicate that (1) automatic CLASS Instructional Support estimation accuracy using the proposed method (Pearson $R$ up to $0.48$) approaches human inter-rater reliability (up to $R=0.55$); (2) LLMs generally yield slightly greater accuracy than BoW for this task, though the best models often combined features extracted from both LLM and BoW; and (3) for classifying individual utterances, there is still room for improvement of automated methods compared to human-level judgments. Finally, (4) we illustrate how the model's outputs can be visualized at the utterance level to provide teachers with explainable feedback on which utterances were most positively or negatively correlated with specific CLASS dimensions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Ang, A. 2020. Solving non-negative least squares with l1-regularization. https://angms.science/doc/CVX/nnls_L1reg.pdf.
  2. No unbiased estimator of the variance of k-fold cross-validation. In Advances in Neural Information Processing Systems, S. Thrun, L. Saul, and B. Schölkopf, Eds. Vol. 16. MIT Press.
  3. Burchinal, M. 2018. Measuring early care and education quality. Child Development Perspectives 12, 1, 3–9.
  4. What does research tell us about ece programs. Foundation for Child Development, Getting It Right: Using Implementation Research to Improve Outcomes in Early Care and Education, 13–36.
  5. Threshold analysis of association between child care quality and child outcomes for low-income children in pre-kindergarten programs. Early childhood research quarterly 25, 2, 166–176.
  6. In search of negative moments: Multi-modal analysis of teacher negativity in classroom observation videos. In Proceedings of the 16th International Conference on Educational Data Mining, M. Feng, T. Käser, and P. Talukdar, Eds. International Educational Data Mining Society, Bengaluru, India, 278–285.
  7. Can automated feedback improve teachers’ uptake of student ideas? evidence from a randomized controlled trial in a large-scale online course. Educational Evaluation and Policy Analysis, 01623737231169270.
  8. Measuring conversational uptake: A case study on student-teacher interactions. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Association for Computational Linguistics, Online, 1638–1653.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), J. Burstein, C. Doran, and T. Solorio, Eds. Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186.
  10. Evaluating the fairness of predictive student models through slicing analysis. In Proceedings of the 9th International Conference on Learning Analytics & Knowledge. LAK19. Association for Computing Machinery, New York, NY, USA, 225–234.
  11. The test matters: The relationship between classroom observation scores and teacher value added on multiple types of assessment. Educational Researcher 43, 6, 293–303.
  12. Evidence for general and domain-specific elements of teacher–child interactions: Associations with preschool children’s development. Child development 85, 3, 1257–1274.
  13. Hamre, B. K. 2014. Teachers’ daily interactions with children: An essential ingredient in effective early childhood programs. Child development perspectives 8, 4, 223–230.
  14. Mathematical knowledge for teaching and the mathematical quality of instruction: An exploratory study. Cognition and instruction 26, 4, 430–511.
  15. The reliability of classroom observations by school personnel. research paper. met project. Bill & Melinda Gates Foundation.
  16. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  17. Have we identified effective teachers? validating measures of effective teaching using random assignment. Research Paper. MET Project. Bill & Melinda Gates Foundation.
  18. Automatically measuring question authenticity in real-world classrooms. Educational Researcher 47, 7, 451–464.
  19. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14, 7684–7689.
  20. Turning towards greater equity and access with online teacher professional development. Journal of STEM outreach 4, 3.
  21. Unpacking intervention effects: Teacher responsiveness as a mediator of perceived intervention quality and change in teaching practice. Early childhood research quarterly 36, 201–209.
  22. Results from a randomized trial of the effective classroom interactions for toddler educators professional development intervention. Early Childhood Research Quarterly 65, 217–226.
  23. Introduction to information retrieval. Cambridge University Press.
  24. Measures of classroom quality in prekindergarten and children’s development of academic, language, and social skills. Child development 79, 3, 732–749.
  25. Scaling down to explore the role of talk in learning: From district intervention to controlled classroom study. Socializing intelligence through academic talk and dialogue, 111–126.
  26. Assessing the dialogic properties of classroom discourse: Proportion models for imbalanced classes. In Proceedings of the International Conference on Educational Data Mining, X. Hu, T. Barnes, A. Hershkovitz, and L. Paquette, Eds. International Educational Data Mining Society, 162–167.
  27. Teaching strategies: A guide to effective instruction. Wadsworth, Cengage Learning.
  28. Learning gain differences between chatgpt and human tutor generated algebra hints. arXiv preprint arXiv:2302.06871.
  29. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), A. Moschitti, B. Pang, and W. Daelemans, Eds. Association for Computational Linguistics, Doha, Qatar, 1532–1543.
  30. A systematic review and meta-analysis of a measure of staff/child interaction quality (the classroom assessment scoring system) in early childhood education and care settings and child outcomes. PloS one 11, 12, e0167660.
  31. National center for research on early childhood education teacher professional development study (2007-2011). Inter-university Consortium for Political and Social Research [distributor].
  32. Early childhood professional development: Coaching and coursework effects on indicators of children’s school readiness. Early Education and Development 28, 8, 956–975.
  33. Classroom Assessment Scoring System™: Manual K-3. Paul H Brookes Publishing.
  34. Children’s school readiness skills across the pre-k year: Associations with teacher-student interactions, teacher practices, and exposure to academic content. Journal of Applied Developmental Psychology 66, 101084.
  35. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning, A. Krause, E. Brunskill, K. Cho, B. Engelhardt, S. Sabato, and J. Scarlett, Eds. Proceedings of Machine Learning Research, vol. 202. PMLR, 28492–28518.
  36. Sentence-BERT: Sentence embeddings using Siamese BERT-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), K. Inui, J. Jiang, V. Ng, and X. Wan, Eds. Association for Computational Linguistics, Hong Kong, China, 3982–3992.
  37. Using transformers to provide teachers with personalized feedback on their classroom discourse: The talkmoves application. ArXiv abs/2105.07949.
  38. Tibshirani, R. 2014. Error and validation: Advanced methods for data analysis (36-402/36-608). https://www.stat.cmu.edu/~ryantibs/advmethods/notes/errval.pdf.
  39. A comprehensive survey of hallucination mitigation techniques in large language models. arXiv preprint arXiv:2401.01313.
  40. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  41. Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction. arXiv preprint arXiv:2306.03090.
  42. Improving vocabulary and pre-literacy skills of at-risk preschoolers through teacher professional development. Journal of educational psychology 103, 2, 455.
  43. Gaussian processes for machine learning. Vol. 2. MIT press Cambridge, MA.
  44. Next-gpt: Any-to-any multimodal llm. CoRR abs/2309.05519.
  45. A review of strategies for validating computer-automated scoring. Advances in Computerized Scoring of Complex Item Formats, 391–412.
  46. Noise-robust key-phrase detectors for automated classroom feedback. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 9215–9219.
Citations (5)

Summary

We haven't generated a summary for this paper yet.