- The paper proposes an Adaptive Learning Management System (ALMS) integrating general-purpose and domain-specific LLMs to create personalized learning experiences.
- The authors detail a three-phase development process for the ALMS prototype, including integrating LLMs and extensively benchmarking models on standardized tests across various subjects.
- Benchmarking results indicated LLMs performed well in reading, writing, and coding but struggled in mathematics, suggesting a hybrid expert system and LLM approach is most practical for an ALMS.
The paper explores integrating LLMs into Learning Management Systems (LMSs) to create an adaptive learning environment. Traditional LMSs often fail to meet the personalized needs of students, especially when instructor availability is limited. The authors propose an Adaptive Learning Management System (ALMS) that leverages AI to offer customizable learning experiences tailored to individual user requirements. The system integrates general-purpose and domain-specific LLMs to mitigate issues such as factual inaccuracies and outdated information, which are common in general LLMs like ChatGPT. The prototype ALMS addresses privacy concerns and limitations of current educational tools while enhancing engagement through personalized content.
The development of the ALMS was structured into three phases. Phase I involved designing command-line scripts for tasks such as Optical Character Recognition (OCR) and data management, along with building a Django web backend and React frontend. The system was refined and containerized using Docker. Phase II focused on integrating LLMs, developing a command-line interface with ChatGPT's Application Programming Interface (API) and testing configurations with Retrieval Augmented Generation (RAG) and vector embedding. Phase III benchmarked various LLMs against human test scores in subjects like mathematics, reading, writing, reasoning, and coding, analyzing resource utilization and performance data.
The literature review highlights the positive impact of intelligent tutoring systems on student learning. Modern AI learning systems, backed by Natural Language Processing (NLP), demonstrate strengths in reading, writing, and language translation. However, they often lack competency in problem-solving, particularly in mathematics, and are prone to generating inaccurate information (hallucinations). The paper also discusses expert systems, detailing their data pipeline for knowledge retrieval. It describes how raw data is collected, codified, and processed into a digital knowledge base, enabling end-users to access relevant knowledge.
The paper explores the architecture of LLMs, explaining their foundation in transformer architecture. It reviews deep learning, neural networks, and various Machine Learning (ML) techniques used in training LLMs, including supervised and unsupervised learning. Fine-tuning, vector embedding, RAG, and prompt engineering are explored as methods to optimize LLM behavior. Vector embedding involves converting words or phrases into vectors representing semantic similarity, while RAG allows the system to access local vectorstores for improved factual validity. The authors note that hallucinations are an innate pitfall of LLMs, due to misinformation or conflicting data during training. Hardware requirements for hosting LLMs locally are also considered, with a table summarizing memory demands based on parameter count.
The implementation of the prototype expert system, built with Python, JavaScript/TypeScript, JSX/HTML, CSS, Shell, and SQL, features functionalities such as question capture, question solver, and document upload. The question solver matches user queries with the closest options in the system's knowledge base. The document upload function converts test bank Portable Document Format (PDF) files into formatted text using OCR and stores the data in a MySQL database. The frontend, built with React, uses a virtual Document Object Model (DOM) and state management system to handle user input and data transmission.
In Phase II, a command-line interface was developed to interact with OpenAI's API, allowing queries to be sent to ChatGPT. This was expanded using the LangChain Python library to test multiple LLM models, including ChatGPT-3.5 Turbo, GPT-3.5 Turbo Instruct, and GPT-4, as well as self-hosted LLMs like Mistral-7B and Llama2-7B via the Ollama extension. Informal experimentation was conducted on system prompt modification, RAG, vector embedding, and prompt engineering. RAG from text documents and webpage URLs was tested, using PDF transcriptions and BeautifulSoup for web scraping. The authors discovered that RAG offers a superior alternative by allowing the source to be curated for the specific use case.
The experimental design in Phase III involved benchmarking LLMs based on standardized tests in mathematics, reading, writing, reasoning, and coding. Each subject category included easy, medium, and hard question sets. Ten LLM models were chosen, and each model was tested three times on the battery, resulting in a large dataset. The LLMs tested included GPT3.5, GPT3.5_I, GPT4, LLAMA2_7B, MISTRAL_7B, GEMMA_7B, FALCON_7B, CODELLAMA, WIZARD_MATH_7B, and PHI. The questions were formatted into JSON files, and results were stored locally. A custom Python grading module was created, involving the filtering of AI responses using regular expressions, combined with manual grading for many questions. System resource usage was also tracked across a range of data points. A modular approach in design, encapsulating sections of the project using Docker containerization and virtual environments, was used.
The mathematics test used questions from the EQAO Grade 3 and Grade 6 Mathematics tests, as well as the ACT Mathematics Test. The reading test sourced questions from the Ontario Secondary School Literacy Test (OSSLT), the ACT Reading test, and LSAT Reading Comprehension practice problems. The writing test included general knowledge questions and practice questions from the OSSLT Writing Test. The hard writing test required models to write a short essay based on a passage from the ACT Writing Test, graded according to ACT guidelines. The reasoning test used verbal classification and numerical reasoning problems, along with questions from the LSAT pertaining to logical reasoning. The coding test involved custom problems based on searching algorithms, with solutions to be written in Python.
The resource utilization test tracked metrics such as average CPU percentage, average memory percentage, system time, user time, and execution time. Celery and Flower were initially used as a task queue, but later abandoned. Control groups were investigated, and the getrusage() function from Python's "resource" library was used to retrieve system time and user time. The psutil library was used to collect CPU and memory percentages, and execution time was found by recording the start and end time of each test. Data representation involved merging data points into a single table and exporting sub-tables into Excel files.
The results showed that performance scaled inversely with difficulty level in mathematics, with Mistral 7B being the top performer. The average results in reading were very strong, with GPT-4 having the highest ranking score. All models had perfect scores on the easy writing test, and high scores on the medium test. In the hard writing test, GPT-4 had the highest score, with Phi-2 close behind. Scores in reasoning were inconsistent, while the LLMs demonstrated proficiency in coding. Overall, the models showed strengths in reading, writing, and coding, and weaknesses in mathematics. Resource utilization tests indicated that proprietary models had slightly higher memory usage, but the difference was negligible. Based on all the tests performed, about 1% of the responses did not include an identifier.
The paper concludes that a hybrid approach, combining expert systems and LLMs, may be the most practical option for an ALMS. Future work may involve building a human-curated knowledge base for an LLM-driven system, investigating local vectorstores, and conducting further benchmark testing. The authors also suggest investigating the reasons behind the models' struggles with specific question wordings and the cause behind the consistent responses produced by different LLMs during the writing tests.