Papers
Topics
Authors
Recent
2000 character limit reached

OpenStaxQA: Multilingual Educational QA Benchmark

Updated 9 October 2025
  • OpenStaxQA is a multilingual evaluation benchmark featuring 18,332 unique problem–solution pairs from 43 open-source college textbooks in English, Spanish, and Polish.
  • It employs a robust HTML parsing and MathML-to-LaTeX conversion pipeline to streamline data processing and reduce token overhead for LLM training.
  • Finetuning with QLoRa on models like Llama2-7B-hf and Llemma-7B enhances LLM capabilities in STEM educational applications and cross-domain performance.

OpenStaxQA is a multilingual evaluation benchmark and dataset specifically constructed for LLMs in college-level educational applications. It is based on 43 open-source textbooks—primarily from the OpenStax initiative—spanning English, Spanish, and Polish, and licensed under a permissive Creative Commons license. The data curation and evaluation infrastructure are designed to advance research and application of LLMs in educational support, particularly in science, mathematics, and related disciplines.

1. Dataset Composition and Source Material

OpenStaxQA is derived from 43 open-source college textbooks that cover a broad range of disciplines, including physics, life sciences, mathematics, business, humanities, and social sciences. The dataset focuses on end-of-chapter exercises, and each entry is a problem–solution pair. After deduplication—implemented to remove overlapping questions from different editions or versions of the same textbook, such as in chemistry—the final dataset consists of 18,332 unique problem–solution pairs.

The organizational structure of the dataset mirrors the explicit HTML annotation used on the source OpenStax web pages. Each question is encapsulated within an <os-problem-container> tag and each corresponding solution within an <os-solution-container> tag. Structural HTML elements—such as lists and inline mathematics rendered in MathML—are processed to extract only content-relevant textual and mathematical information.

Mathematical content initially present in MathML is uniformly converted to LaTeX using the texmath Haskell library. This conversion not only preserves semantic mathematical structure but also reduces token overhead for LLM-based training and inference.

2. Data Preparation and Finetuning Methodology

Data extraction from OpenStax materials was accomplished using standard web scraping libraries such as Beautiful Soup, followed by a multi-stage preprocessing pipeline:

  • Discovery of textbook and exercise page URLs.
  • Targeted extraction of question and solution text by parsing for <os-problem-container> and <os-solution-container> tags.
  • Conversion of MathML content to simple LaTeX using texmath.
  • Additional HTML preprocessing with html2text to strip non-mathematical markup.
  • Deduplication and cleaning to construct a finalized, research-ready dataset.

For model finetuning, two open-source LLMs with approximately 7 billion parameters were selected:

  • Llama2-7B-hf (general-purpose)
  • Llemma-7B (math-optimized)

Quantized low rank adapters (QLoRa) were employed as the parameter-efficient finetuning method. Training was conducted for 3 epochs on a single V100 GPU (32 GB VRAM), batch size 4, with a dropout rate of 0.1 applied to the 32 LlamaDecoder layers. The maximum generation length for answers is dynamically calculated as ⌊T×S⌋\lfloor T \times S \rfloor, where TT is the mean solution-to-problem token length ratio and SS is the problem token length.

3. Evaluation Protocol and Zero-shot Reasoning Assessment

Model evaluation involved both in-domain and out-of-domain testing. For primary scoring, GPT-4 was used as a proxy oracle, classifying each generated answer into one of five categories: FULLY ACCURATE, MOSTLY ACCURATE, PARTIALLY ACCURATE, MOSTLY INACCURATE, and FULLY INACCURATE. This evaluation was applied to the OpenStaxQA test set, which comprised a 30% split of the full data.

To measure model generalization, a zero-shot evaluation was conducted on the AI2 Reasoning Challenge (AI2RC) development set. AI2RC contains English-only, grade-school-level science questions that are non-LaTeX heavy and structurally distinct from college-level OpenStaxQA problems. Despite this domain shift, finetuned models, especially Llama2-7B-hf, exhibited consistent improvement over their non-finetuned baselines on the AI2RC challenge set, suggesting cross-domain transferability.

It was observed that models optimized for mathematical content (e.g., Llemma-7B) performed less well on the AI2RC dataset than the general Llama2-7B-hf model, possibly due to the absence of complex LaTeX or advanced math in the grade-school evaluation set.

4. Technical Characteristics and Preprocessing Considerations

A distinctive aspect of OpenStaxQA is its math pre-processing pipeline. All mathematics originally rendered in MathML was converted to LaTeX via texmath. This was undertaken to address the token overhead and parsing inefficiencies of MathML in LLM processing. The conversion yields more compact and LLM-friendly representations of mathematical formulae.

HTML preprocessing using html2text further ensures that only pedagogically meaningful content is retained. Deduplication addresses redundancy across textbooks, particularly in cases where exercises are repeated across versions (e.g., Chemistry, 2e and Chemistry: Atoms First).

For generation length control, the average solution length TT (as a multiplicative factor of the problem input length SS) was empirically determined and the generated answer truncated at ⌊T×S⌋\lfloor T \times S \rfloor tokens.

5. Educational Applications and Broader Impacts

OpenStaxQA enables reliable evaluation of and finetuning for LLMs aimed at college-level educational support. Main benefits include:

  • Higher-quality model responses for college-level math, physics, and life sciences problem-solving.
  • Multilingual coverage (English, Spanish, Polish), addressing a major gap in educational NLP benchmarks and directly supporting the needs of classrooms in multiple linguistic regions.
  • Foundation for cross-discipline and cross-lingual educational resources, with significant implications for self-help educational tools, intelligent tutoring systems, and automated homework support.
  • Direct applicability for transfer learning: finetuned models on OpenStaxQA generalize better to out-of-domain benchmarks, indicating improved underlying reasoning capacities.
  • OpenStaxQA’s technical pipeline serves as a reproducible template for future educational data extraction projects, particularly regarding mathematical content.

A plausible implication is that further inclusion of more languages and broader textbook coverage will further reduce educational resource disparities in non-English-speaking regions and improve LLM performance for low-resource educational contexts.

6. Limitations and Prospects for Dataset Expansion

Although covering three languages and numerous disciplines, the dataset is still primarily concentrated in English. While Spanish and Polish data are present, the English portion is dominant. Expanding this multilingual coverage remains a key challenge for future work, particularly for broader European, Asian, and African languages.

The scope is also focused on problem–solution exercise pairs, as opposed to other pedagogical modalities such as worked examples, open-ended essays, or diagram-based problems. Expansion into these modalities and integration with complementary datasets—especially those with richer multi-modal or open-ended question types—could provide further value for LLM development.

7. Summary Table

Feature Description Technical Note
Source textbooks 43 OpenStax textbooks in 3 languages HTML-organized, dedicated tags for Q/A extraction
Problem–solution pairs 18,332 unique entries (after deduplication) Focus on end-of-chapter exercises
Math representation MathML converted to LaTeX using texmath Reduces token overhead in LLM training/testing
Finetuned models Llama2-7B-hf, Llemma-7B (both ~7B parameters) QLoRa, 3 epochs, V100 GPU (32 GB), batch size 4, dropout 0.1
Evaluation protocol GPT-4 as proxy oracle; 5-class rubric; zero-shot AI2RC dev set Length control: ⌊T×S⌋\lfloor T \times S \rfloor
Educational impact Multilingual, open, technically rigorous QA benchmark; supports LLM adaptation to STEM Validated on generalization across education benchmarks

OpenStaxQA thus stands as a technically robust, multilingual, and openly licensed benchmark that supports the development, finetuning, and evaluation of educational LLMs for advanced college-level applications (Gupta, 3 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OpenStaxQA.