TemplateMath Part I: TemplateGSM
- TemplateGSM is a large-scale, structured dataset that generates millions of grade school math problems with verified code and natural language solutions.
- It uses a hierarchical, template-based data generation methodology with meta-templates diversified by GPT-4 to ensure arithmetic and narrative diversity.
- The dataset enhances LLM training by addressing prior limitations, improving numerical reasoning and providing robust evaluation benchmarks.
TemplateMath Part I: TemplateGSM is a large-scale, highly structured, and automatically generated dataset designed to facilitate the training, fine-tuning, and evaluation of LLMs on mathematical reasoning tasks, specifically grade school mathematics word problems. Originating from the Template-based Data Generation (TDG) paradigm, TemplateGSM provides millions of diverse, fully verified, parameterized math problems, each paired with both executable code solutions and detailed natural language explanations. Its development responds to the limitations of existing mathematical corpora in size, diversity, and verifiability, and leverages meta-template generation by advanced LLMs (notably GPT-4) to synthesize a nearly unbounded corpus for robust model training (Zhang, 2024).
1. Motivation and Limitations of Prior Mathematical Datasets
LLMs such as GPT-3, PaLM, and Llama display high linguistic competence but typically underperform on tasks that require multi-step numerical reasoning or algebraic manipulation. Prior datasets, such as MATH (≈12k manually authored problems), are orders of magnitude too small and lack the domain-specific breadth and solution diversity required for contemporary neural models. Web-scraped corpora suffer from poor alignment and lack verified solutions. Data augmentation strategies such as paraphrasing or entity swapping can only marginally increase diversity and do not yield the combinatorial coverage necessary for training generalizable reasoning skills. TemplateGSM addresses this resource gap by providing millions of problems across all relevant grade levels, tightly coupled with code and stepwise rationales (Zhang, 2024).
2. Design Principles and Template-Based Data Generation (TDG) Methodology
TemplateGSM adopts a hierarchical, template-centric methodology:
- Meta-templates serve as high-level schemata, formally encoding the logical and arithmetic relationships that define problem families. Each meta-template parameterizes the canonical narrative structure (e.g., variable placeholders for objects, quantities, rates, operations), with constraints for problem solvability and mathematical validity.
- Generation uses GPT-4: A curated set of 7,473 seed meta-templates are produced and diversified via iterative LLM prompting, ensuring lexical, numerical, and narrative variety that subsume a broad spectrum of elementary math categories (arithmetic, ratios, algebra, geometry).
- Instantiation is conducted by systematic sampling over parameter ranges, enforcing (reject) constraints so only valid/sensible problems with non-negative, integer answers, etc., are realized. Auxiliary slots (names, places, dates) are randomized to maximize linguistic variation.
- Solution Verification: For each instantiation, executable Python programs are synthesized to carry out the arithmetic implied by the meta-template; these are cross-validated with LLM-generated natural language solutions. Only examples passing the exact answer match consistency check are retained, guaranteeing 100% correctness in the released dataset.
A generic meta-template structure is illustrated below:
1 2 3 4 5 6 7 8 9 10 |
{
"template_id": 123,
"text": "{name} had {n} {item}. Then she bought {r}× as many. After that she gave away {g}. How many does she have now?",
"slots": {
"n": {"type":"int","range":[5,200]},
"r": {"type":"float","choices":[1.1,1.2,…]},
"g": {"type":"int","range":[1,50]}
},
"constraints": ["n*(1+r)-g >= 0"]
} |
3. Synthesis Pipeline: Problem Generation and Solution Pairing
The core instantiation pipeline proceeds as follows (Zhang, 2024):
- Parameter Sampling: For each meta-template, parameters are randomly sampled within the prescribed domain and constraint conditions are checked. Invalid samples are discarded.
- Linguistic Diversification: Names, narrative details, and contexts are varied using controlled randomization over curated name/item lists to ensure uniqueness and reduce model memorization risk.
- Automated Solution Script Generation: Each instance is paired with an auto-generated Python script implementing the problem's operations, assigning the correct value to the target variable “result.”
- Natural Language Solution Synthesis: Either GPT-4 or a fine-tuned smaller model is prompted to generate stepwise explanatory rationales that use the specific sampled values. These explanations are subsequently verified against the code solution for numerical concordance.
- Validation Loop: Only instances where the code and text solutions agree on the output are included. This approach eliminates noisy or questionable samples.
A concrete illustration:
- Template: Emily has apples, buys times more, gives away .
- Instantiated parameters: , ,
- Code:
1 2 3 4 5
initial = 15 bought = initial * 3 total = initial + bought remaining = total - 5 result = remaining # 55
- Natural language: “First, Emily buys 3 times 15 = 45 apples. Now she has 15 + 45 = 60 apples. After giving away 5, she ends with 60 – 5 = 55 apples.”
4. Composition, Scale, and Diversity of the Corpus
TemplateGSM consists of (Zhang, 2024):
| Metric | Value | Description |
|---|---|---|
| Unique meta-templates | 7,473 | Sourced/generated via GPT-4 |
| Total individual problems | 7,473,000 | ≈1,000 per meta-template |
| Problem types | Arithmetic (45%), Ratios etc. | Tagging by template logical structure |
| Grade levels | 1–8 | Arithmetic, basic algebra, elementary geometry |
| Solution representation | Code + Natural language | Dual-verification required |
| Problem length (tokens) | Mean ≈ 50, range 18–636 | Per problem statement |
| Code solution length (tokens) | Mean ≈ 123 | Executes to unique numeric answer |
| NL solution length (tokens) | Mean ≈ 78 | Stepwise with explicit arguments |
Meta-template categories comprehensively cover elementary arithmetic, fractions, rates, geometry (rectangles/triangles), and introductory algebra. Templates embed explicit value and “difficulty” constraints, ensuring computational solvability.
5. Validation, Benchmarking, and Model Training Impact
- Quality Assurance: Full automated cross-validation (code and natural language) yields a dataset with empirical 100% correctness and consistency for the included samples.
- Linguistic and Logical Diversity: Each meta-template underlies thousands of distinct instantiations; GPT-4’s paraphrasing mechanisms further diversify surface realizations, mitigating overfitting and memorization risk (Zhang, 2024).
- Empirical Performance: Fine-tuning a 7B-parameter LLaMA on 1 million TemplateGSM examples raises its accuracy on the MATH benchmark by 8 absolute points. Full-corpus continual pre-training elevates held-out grade-level math question accuracy from 24% to 36%. Human evaluators judge 95% of generated problems as “natural-looking, non-trivial,” and 98% of NL explanations as “clear and correct.”
- Diagnostic Utility: TemplateGSM’s coverage enables precise, large-scale diagnosis of failure patterns and reasoning deficiencies in contemporary LLMs, which are intractable with smaller or less controlled datasets.
6. Practical Access, Structure, and Reproducibility
Access and reproducibility are central to TemplateGSM’s design (Zhang, 2024):
- Availability:
- Problems, solutions, answers at https://huggingface.co/datasets/math-ai/TemplateGSM
- Source code and pipeline at https://github.com/iiis-ai/TemplateMath
- Dataset Format: Each entry is JSONL, fields include “template_id”, “problem_text”, “solution_code”, “solution_nl”, “answer”.
- Pipeline Modularity: Adding new templates or generating new data involves injecting a JSON meta-template, running the instantiation script (
/generators/instantiate.py), and verifying with/verify/run_checks.py. - Extensibility: The methodology enables on-demand synthesis of arbitrarily large test splits—essential for robustness and generalization diagnostics.
7. Comparison with Related Approaches and Future Directions
TemplateGSM exceeds prior elementary math datasets in scale, diversity, and verification rigor. Its meta-template framework obviates the need for hand-labeling and supports near-infinite augmentation, distinguishing it from static, manually authored corpora and surface-level perturbation-based augmentation. The approach is aligned with recent advances in semantic augmentation of benchmarks (see GSM-SEM) but is unique in guaranteeing both answer and computation diversity at generation-time while maintaining high verification standards (Zhang, 2024).
A plausible implication is that such TDG-based datasets will increasingly serve as backbone resources for the next generation of LLMs, supporting both model scaling and fine-grained evaluation of reasoning capability.
References
- "Training and Evaluating LLMs with Template-based Data Generation" (Zhang, 2024)