Language Models as Science Tutors (2402.11111v2)
Abstract: NLP has recently made exciting progress toward training LLMs (LMs) with strong scientific problem-solving skills. However, model development has not focused on real-life use-cases of LMs for science, including applications in education that require processing long scientific documents. To address this, we introduce TutorEval and TutorChat. TutorEval is a diverse question-answering benchmark consisting of questions about long chapters from STEM textbooks, written by experts. TutorEval helps measure real-life usability of LMs as scientific assistants, and it is the first benchmark combining long contexts, free-form generation, and multi-disciplinary scientific knowledge. Moreover, we show that fine-tuning base models with existing dialogue datasets leads to poor performance on TutorEval. Therefore, we create TutorChat, a dataset of 80,000 long synthetic dialogues about textbooks. We use TutorChat to fine-tune Llemma models with 7B and 34B parameters. These LM tutors specialized in math have a 32K-token context window, and they excel at TutorEval while performing strongly on GSM8K and MATH. Our datasets build on open-source materials, and we release our models, data, and evaluations.
- Alexis Chevalier (10 papers)
- Jiayi Geng (9 papers)
- Alexander Wettig (21 papers)
- Howard Chen (31 papers)
- Sebastian Mizera (57 papers)
- Toni Annala (23 papers)
- Max Jameson Aragon (1 paper)
- Arturo RodrÃguez Fanlo (4 papers)
- Simon Frieder (11 papers)
- Simon Machado (12 papers)
- Akshara Prabhakar (13 papers)
- Ellie Thieu (2 papers)
- Jiachen T. Wang (24 papers)
- Zirui Wang (83 papers)
- Xindi Wu (13 papers)
- Mengzhou Xia (34 papers)
- Jiatong Yu (5 papers)
- Jun-Jie Zhu (1 paper)
- Zhiyong Jason Ren (3 papers)
- Sanjeev Arora (93 papers)