Papers
Topics
Authors
Recent
Search
2000 character limit reached

HSPMATH Dataset for Guided Math Reasoning

Updated 12 March 2026
  • HSPMATH is a large-scale dataset of 75,000 structured problem-hint-solution triples designed to guide mathematical reasoning in LLMs.
  • It employs a MetaMath rewriting protocol to generate nine paraphrased variants per GSM8K problem and leverages GPT-4 to produce consistent hints.
  • Its application in fine-tuning models like Llemma-7B yields significant accuracy improvements on GSM8K benchmarks, setting new performance baselines.

HSPMATH is a large-scale dataset constructed to support research on guided mathematical reasoning with LLMs using Hint-before-Solving Prompting (HSP). It comprises 75,000 problem–hint–solution triples focused on multi-step arithmetic and pre-algebra word problems, systematically expanding GSM8K through prompted, paraphrased data augmentation combined with GPT-4–generated hints. HSPMATH facilitates supervised fine-tuning for LLMs, demonstrating substantial accuracy gains on reasoning benchmarks relative to prior approaches (Fu et al., 2024).

1. Origin and Construction

HSPMATH is derived from the GSM8K corpus, specifically its 7,500 training problems in the domain of grade-school mathematics word problems. To achieve a scale of 75,000 examples, each GSM8K problem was paraphrased or “rewritten” nine times using the MetaMath rewriting protocol. This approach modifies surface linguistic structure while preserving the original numeric content and answer logic.

Hints for each original problem were generated by prompting GPT-4, using in-context exemplars of high-quality hints tailored to the problem’s reasoning pathway. Each set of nine paraphrased variants received a direct copy of the original hint, with no independent re-generation per paraphrase. The process did not involve further human-in-the-loop validation or additional normalization steps such as tokenization or number formatting. No indication of explicit sub-categories (such as arithmetic vs. algebra vs. geometry) or difficulty levels is provided.

2. Data Schema and Representation

Every instance in HSPMATH consists of distinct text fields in the following order:

  1. Question: A rewritten GSM8K word problem.
  2. Hint: One or two sentences highlighting the core strategy, formula, or step crucial for solution (e.g., “Consider the circumference formula C=2πrC=2\pi r,” or “First compute weekly earnings, then multiply by weeks per year.”).
  3. Solution: A step-by-step “Chain of Thought” (CoT) derivation interleaving natural language reasoning with inline LaTeX mathematics, concluding with the final answer. Example solution text may incorporate formulas such as:

Solution:  “The circumference is 2π(10)=20π. Over 150 turns it travels\text{Solution: … “The circumference is}~2\pi(10)=20\pi. \text{ Over 150 turns it travels}

150×20π=3000π150 \times 20\pi = 3000\pi

\text{…”}

The dataset is presumably serialized as text or JSONL, containing three fields for each entry, although the exact file format is not detailed in the source.

3. Dataset Scale and Structure

Subset Number of Triples Description
HSPMATH-1 7,500 GSM8K originals with hints
HSPMATH (full) 75,000 Paraphrased expansion (9× rewriters)

HSPMATH is thus an order-of-magnitude expansion over the GSM8K training set, but remains strictly within the arithmetic and pre-algebra reasoning domain. No new or fine-grained difficulty labels are introduced, and the dataset does not include explicit category splits or alternative mathematical topics. No statistics are given regarding average length in words or tokens; however, analyses indicate that including a hint typically reduces the CoT solution length by 5–15%.

4. Data Generation Methodology

The paraphrasing protocol transforms each GSM8K problem via MetaMath’s rewriting process, targeting linguistic variation while constraining answer preservation. The hint generation phase for the 7,500 original problems uses GPT-4 prompted with high-quality, domain-appropriate hint exemplars. These hints act as “soft scaffolding” to orient the model toward relevant knowledge or strategies before producing the solution. Paraphrases inherit their original’s hint verbatim, ensuring internal consistency across augmented variants.

No post-hoc filtering, manual correction, or secondary annotation step is applied. Paraphrased problems aim to yield lexical diversity and expanded data volume but preserve the critical numeric and logical structure of the original GSM8K problem. No details regarding additional preprocessing (such as answer normalization, tokenization, or format enforcement) are reported.

5. Application in Model Training and Evaluation

HSPMATH is designed for supervised fine-tuning (SFT) of LLMs on structured mathematical reasoning tasks. Experiments reported in (Fu et al., 2024) involve SFT with Llemma-7B and Llama-2-13B. The GSM8K test set (1,319 problems) is used as the principal benchmark for downstream evaluation.

Performance highlights include:

  • HSP-Llemma-7B (Llemma-7B fine-tuned on HSPMATH) attains 64.3% accuracy on GSM8K, a substantial improvement over its zero-shot baseline (36.4%).
  • This result outperforms GPT-3.5 (57.1%) and WizardMath-13B (63.9%) on the same benchmark.
  • Accuracy is computed as

Accuracy=#{correct answers}M×100%,\text{Accuracy} = \frac{\#\{\text{correct answers}\}}{M} \times 100\%,

where MM denotes the number of test examples.

No formal datasheet or detailed account of training hyperparameters (learning rate, batch size, number of epochs, or optimizer specification) is provided.

6. Limitations and Omitted Details

Several standard dataset descriptors are not reported:

  • No explicit per-category (e.g., arithmetic subdomains) or difficulty splits.
  • The distribution of problem types, answer formats, or syntactic diversity beyond paraphrasing remains unspecified.
  • The dataset’s serialization format and access details, apart from code availability on GitHub, are not elaborated.
  • Quality control is limited to GPT-4 hint generation and constrained paraphrasing, with no documented manual verification or error correction.

A plausible implication is that downstream consumers of HSPMATH will need to engage in their own filtering, normalization, or subsetting as necessary for specialized research objectives.

7. Impact and Research Significance

HSPMATH operationalizes the Hint-before-Solving Prompting paradigm at scale, providing a resource for systematically studying the benefits of structured hints in mathematical reasoning with LLMs. Its construction methodology, focus on step-by-step CoT with explicit hinting, and large-scale augmentation from a canonical benchmark (GSM8K) contribute to its practical significance. The dataset supports reproducible SFT for LLMs and establishes new baselines on established benchmarks (Fu et al., 2024).

The dataset reflects a broader trend toward decompositional, structure-aware data resources that facilitate model interpretability and explainable reasoning in machine learning for mathematics. The omission of granular metadata and manual curation is partly offset by the transparent data generation pipeline and reported empirical improvements on competitive models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HSPMATH Dataset.