Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AI-TA: Towards an Intelligent Question-Answer Teaching Assistant using Open-Source LLMs (2311.02775v3)

Published 5 Nov 2023 in cs.LG, cs.AI, and cs.CL

Abstract: Responding to the thousands of student questions on online QA platforms each semester has a considerable human cost, particularly in computing courses with rapidly growing enroLLMents. To address the challenges of scalable and intelligent question-answering (QA), we introduce an innovative solution that leverages open-source LLMs from the LLaMA-2 family to ensure data privacy. Our approach combines augmentation techniques such as retrieval augmented generation (RAG), supervised fine-tuning (SFT), and learning from human preferences data using Direct Preference Optimization (DPO). Through extensive experimentation on a Piazza dataset from an introductory CS course, comprising 10,000 QA pairs and 1,500 pairs of preference data, we demonstrate a significant 30% improvement in the quality of answers, with RAG being a particularly impactful addition. Our contributions include the development of a novel architecture for educational QA, extensive evaluations of LLM performance utilizing both human assessments and LLM-based metrics, and insights into the challenges and future directions of educational data processing. This work paves the way for the development of AI-TA, an intelligent QA assistant customizable for courses with an online QA platform

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. MS Windows NT kernel description. https://www.thecrimson.com/article/2023/6/21/cs50-artificial-intelligence/. Accessed: 2023-09-16.
  2. Palm 2 technical report, 2023.
  3. Constitutional ai: Harmlessness from ai feedback, 2022.
  4. Does the whole exceed its parts? the effect of AI explanations on complementary team performance. In Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, number Article 81 in CHI ’21, pages 1–16, New York, NY, USA, May 2021. Association for Computing Machinery. ISBN 9781450380966. doi: 10.1145/3411764.3445717. URL https://doi.org/10.1145/3411764.3445717.
  5. Nadia Bidarian. Meet khan academy’s chatbot tutor. CNN, August 2023. URL https://www.cnn.com/2023/08/21/tech/khan-academy-ai-tutor/index.html.
  6. Nougat: Neural optical understanding for academic documents. August 2023. URL http://arxiv.org/abs/2308.13418.
  7. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  8. Frugalgpt: How to use large language models while reducing cost and improving performance. arXiv preprint arXiv:2305.05176, 2023.
  9. Palm: Scaling language modeling with pathways, 2022.
  10. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  11. Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
  12. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314, 2023.
  13. Agent smith: Teaching question answering to jill watson. arXiv preprint arXiv:2112.13677, 2021.
  14. Jill watson. Learning engineering for online education: Theoretical contexts and design-based examples. Routledge, 2018.
  15. Realm: Retrieval-augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, ICML’20. JMLR.org, 2020.
  16. HF. Hf. https://huggingface.co/blog/llm-leaderboard, 2023.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  18. Who answers it better? an In-Depth analysis of ChatGPT and stack overflow answers to software engineering questions. August 2023. URL http://arxiv.org/abs/2308.02312.
  19. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online, November 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.emnlp-main.550. URL https://aclanthology.org/2020.emnlp-main.550.
  20. Prometheus: Inducing fine-grained evaluation capability in language models, 2023.
  21. CrypTen: Secure multi-party computation meets machine learning. September 2021. URL https://scontent.fagc1-1.fna.fbcdn.net/v/t39.8562-6/260321120_621702352397035_7531336426414693679_n.pdf?_nc_cat=106&ccb=1-7&_nc_sid=ad8a9d&_nc_ohc=xI-tMErFcxQAX8damhf&_nc_ht=scontent.fagc1-1.fna&oh=00_AfAvpJcL-YE4dkqTV2mheFleKL4UO8kPh3SkpCYW_5VznQ&oe=650B4BFE.
  22. Reliability and learnability of human bandit feedback for sequence-to-sequence reinforcement learning. arXiv preprint arXiv:1805.10627, 2018.
  23. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. February 2023. URL http://arxiv.org/abs/2302.09664.
  24. Deduplicating training data makes language models better. July 2021. URL http://arxiv.org/abs/2107.06499.
  25. Retrieval-Augmented generation for Knowledge-Intensive NLP tasks. May 2020. URL http://arxiv.org/abs/2005.11401.
  26. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023a.
  27. Alpacaeval: An automatic evaluator of instruction-following models, 2023b.
  28. Less but enough: Evaluation of peer reviews through pseudo-labeling with less annotated data. JEDM, 15(2):123–140, June 2023a. ISSN 2157-2100, 2157-2100. doi: 10.5281/zenodo.7304981. URL https://jedm.educationaldatamining.org/index.php/JEDM/article/view/613.
  29. G-eval: Nlg evaluation using gpt-4 with better human alignment, 2023b.
  30. ReACC: A retrieval-augmented code completion framework. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, editors, Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6227–6240, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.431. URL https://aclanthology.org/2022.acl-long.431.
  31. HypoCompass: Large-Language-Model-based tutor for hypothesis construction in debugging for novices. October 2023. URL http://arxiv.org/abs/2310.05292.
  32. Medicine National Academies of Sciences, Engineering et al. Assessing and responding to the growth of computer science undergraduate enrollments. National Academies Press, 2018.
  33. N Nguyen and Sarah Nadi. An empirical evaluation of GitHub copilot’s code suggestions. 2022 IEEE/ACM 19th International Conference on Mining Software Repositories (MSR), 2022. doi: 10.1145/3524842.3528470. URL https://www.semanticscholar.org/paper/cdfe9580f63070f311151444f9df32818cc858bf.
  34. OpenAI. Gpt-4 technical report, 2023.
  35. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  36. The RefinedWeb dataset for falcon LLM: Outperforming curated corpora with web data, and web data only. June 2023a. URL http://arxiv.org/abs/2306.01116.
  37. The refinedweb dataset for falcon llm: Outperforming curated corpora with web data, and web data only, 2023b.
  38. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  39. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  40. CoQA: A conversational question answering challenge. Trans. Assoc. Comput. Linguist., 7:249–266, November 2019. ISSN 2307-387X. doi: 10.1162/tacl_a_00266. URL https://direct.mit.edu/tacl/article/43511.
  41. The programmer’s assistant: Conversational interaction with a large language model for software development. February 2023. URL http://arxiv.org/abs/2302.07080.
  42. Code llama: Open foundation models for code. August 2023. URL http://arxiv.org/abs/2308.12950.
  43. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  44. Sotana: The open-source software development assistant. arXiv preprint arXiv:2308.13416, 2023.
  45. Multi-institution encrypted medical imaging ai validation without data sharing. SSRN Electron. J., 2021. ISSN 1556-5068. doi: 10.2139/ssrn.3973993. URL https://www.microsoft.com/en-us/research/publication/multi-institution-encrypted-medical-imaging-ai-validation-without-data-sharing/.
  46. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  47. Gender and engagement in cs courses on piazza. In Proceedings of the 52nd ACM Technical Symposium on Computer Science Education, SIGCSE ’21, page 438–444, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450380621. doi: 10.1145/3408877.3432395. URL https://doi.org/10.1145/3408877.3432395.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. ClinicalGPT: Large language models finetuned with diverse medical data and comprehensive evaluation. June 2023a. URL http://arxiv.org/abs/2306.09968.
  50. Is ChatGPT a good NLG evaluator? a preliminary study. March 2023b. URL http://arxiv.org/abs/2303.04048.
  51. MINT: Evaluating LLMs in multi-turn interaction with tools and language feedback. September 2023c. URL http://arxiv.org/abs/2309.10691.
  52. Pandalm: Reproducible and automated language model assessment, 2023.
  53. Conversational question answering: a survey. Knowl. Inf. Syst., 64(12):3151–3195, December 2022. ISSN 0219-1377, 0219-3116. doi: 10.1007/s10115-022-01744-y. URL https://doi.org/10.1007/s10115-022-01744-y.
  54. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum?id=SkeHuCVFDr.
  55. Towards a unified Multi-Dimensional evaluator for text generation. October 2022. URL http://arxiv.org/abs/2210.07197.
  56. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023.
  57. Fine-tuning language models from human preferences. arXiv preprint arXiv:1909.08593, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yann Hicke (10 papers)
  2. Anmol Agarwal (10 papers)
  3. Qianou Ma (7 papers)
  4. Paul Denny (67 papers)
Citations (15)