Papers
Topics
Authors
Recent
Search
2000 character limit reached

MathFusionQA: Fused Math Problem-Solving

Updated 9 April 2026
  • MathFusionQA is a large-scale, instruction-fused mathematical QA dataset designed to enhance compositional and context-sensitive reasoning in language models.
  • It employs sequential, parallel, and conditional fusion strategies to combine problems, enabling multi-problem integration and chained inference.
  • It boosts data efficiency and performance, achieving significant accuracy gains on standard and complex mathematical benchmarks.

MathFusionQA is a large-scale, instruction-fused mathematical question answering (QA) dataset and training protocol designed to enhance the compositional, relational, and context-sensitive reasoning abilities of LLMs. Developed as part of the MathFusion framework, MathFusionQA systematically generates complex, cross-problem instructions by fusing pairs of mathematical problems. This construction enables the development and fine-tuning of LLMs that can more effectively model human-like mathematical proficiency—particularly in tasks requiring chained inference, parallel problem-solving, and conditional decision-making—while maintaining high data efficiency and broad generalization across both standard and challenging mathematical benchmarks (Pei et al., 20 Mar 2025).

1. Conceptual Foundations and Motivation

Traditional mathematical QA datasets, such as GSM8K and MATH, focus on isolated word problems and single-turn solution formats. While LLMs trained on these data have achieved notable zero- and few-shot performance via chain-of-thought prompting, their abilities are largely restricted to instance-level reasoning. These models typically lack relational compositionality: the capacity to synthesize knowledge across multiple related problems or integrate stepwise solutions into higher-order inference, as is characteristic of human mathematical learning.

MathFusionQA addresses this limitation by leveraging cross-problem instruction fusion, introducing complex problem structures that explicitly require composition of subproblem solutions. This is motivated by the observation that data efficiency gains plateau with standard syntactic augmentation—such as rephrasing or single-problem paraphrasing—since these do not capture dependencies or analogies inherent to advanced problem-solving (Pei et al., 20 Mar 2025).

2. Instruction Fusion Strategies

MathFusionQA is constructed by algorithmic synthesis of problem pairs drawn from a base dataset—e.g., GSM8K (elementary algebra) or MATH (competition-level mathematics). Every source problem is embedded via a high-capacity text embedding model (OpenAI text-embedding-3-large), and nearest-neighbor retrieval identifies semantically similar problems for pairing.

Three primary fusion strategies are employed:

  1. Sequential Fusion: The solution to a first problem (PAP_A) is explicitly required as input for a second problem (PBP_B). Formally, the fused problem PFseqP_F^{\mathrm{seq}} is structured as PB(PA(â‹…))P_B(P_A(\cdot)), enforcing a dependency chain that models multi-step real-world reasoning.
  2. Parallel Fusion: Two analogous problems are presented side-by-side (PA′P_A' and PB′P_B'), requiring the solver (and thus the model) to concurrently solve both, often with additional constraints or algebraic composition. The prompt construction Φ(PA′,PB′)\Phi(P_A', P_B') results in a problem where the final answer typically aggregates or compares the sub-answers.
  3. Conditional Fusion: Both PAP_A and PBP_B must be solved separately, followed by a conditional operation—such as a selection or comparison—that determines the final answer. The merged prompt Γ(PA,PB)\Gamma(P_A, P_B) targets advanced instruction-following and discriminative assessment.

Each fusion process is executed with automation via a strong teacher LLM (e.g., GPT-4o-mini), with subsequent step-by-step solution generation and LLM-based filtering for quality assurance. Approximately 5.6% of fused instructions are filtered out for incompleteness or ambiguity before final compilation (Pei et al., 20 Mar 2025).

3. Dataset Construction and Statistics

The MathFusionQA dataset is built upon both the GSM8K and MATH training splits. The systematic pipeline is as follows:

  • Embedding and Pairing: All base problems are embedded in a shared semantic space. Each problem retrieves its nearest non-identical neighbor to form PBP_B0 pairs.
  • Fusion and Solution Generation: Each pair is subjected to three fusion templates—sequential, parallel, conditional—via LLM prompting. Step-by-step solutions are also generated in LaTeX format.
  • Filtering and Union: Fused items judged "False" for completeness or clarity by a verification LLM are excluded. Accepted fusion items are unified with the original training data.

A summary of dataset composition:

Source/Fusion Type Number of Samples
GSM8K original 7,500
MATH original 7,500
Sequential fusion 15,000
Parallel fusion 15,000
Conditional fusion 15,000
Total (MathFusionQA) 60,000

Notably, 83% of fused problem pairs from MATH are drawn from identical problem categories, preserving topical coherence (e.g., Geometry–Geometry). The final dataset thus provides a balanced and diverse aggregated corpus of algebra, geometry, combinatorics, and advanced mathematical domains (Pei et al., 20 Mar 2025).

4. Experimental Setup and Training Regimes

MathFusionQA serves as an instruction-tuning resource for a range of LLMs, including DeepSeekMath-7B (mathematics-specialized), Mistral-7B, and Llama3-8B (general-purpose). The fine-tuning process comprises several regimes:

  • Standard: Training only on original data (15,000 samples).
  • Single Fusion Strategy: Augmenting with 30,000 synthetic samples from a single fusion strategy (45,000 total).
  • Combined Fusion: All three fusion types combined (75,000 samples).
  • Comparative Baselines: Evaluations against augmentation-focused baselines such as MetaMath, MMIQC, RefAug, and DART-Math at identical (60,000) data scales.

Training uses batch size 128, 3 epochs, 4096 token context, and a learning rate with peak PBP_B1 (linear warm-up and cosine decay). Prompt templates explicitly demarcate question and answer boundaries and elicit stepwise rationales at evaluation (Pei et al., 20 Mar 2025).

For evaluation, both in-domain (GSM8K, MATH) and out-of-domain (CollegeMath, DeepMind-Mathematics, OlympiadBench-Math, TheoremQA) benchmarks are used. The principal metric is strict accuracy,

PBP_B2

computed under 0-shot greedy decoding protocols.

5. Performance Analysis and Ablation

MathFusionQA achieves marked gains in mathematical reasoning accuracy across all tested models and domains. For example, with only 45,000 total training samples (15,000 original + 30,000 sequential-fusion), Llama3-8B's average accuracy across six benchmarks increases from 22.2% to 35.6% (+13.4 points). Applying all three fusion strategies (75,000 total) results in 39.0% accuracy (+16.8 points). Across three base models, the average increment over "Standard" is approximately +18.0 points, using less than 60,000 synthetic instructions.

When compared to augmentation-heavy baselines (e.g., DART-Math with 590,000 data points), MathFusionQA achieves comparable or superior performance at a fraction of the data cost (Llama3-8B: 39.0% vs. 37.6%) (Pei et al., 20 Mar 2025).

Ablation studies demonstrate that each fusion strategy confers additive benefits. For Llama3-8B, performance on MATH/GSM8K rises as follows:

  • Standard: 17.5 / 65.4
  • +Sequential: 38.8 / 77.9
  • +Parallel: 38.1 / 75.4
  • +Conditional: 34.7 / 76.9
  • +All three: 46.5 / 79.2

Difficulty analysis with Mathstral-7B shows that fused instructions possess higher instruction-following difficulty (IFD) despite lower unconditional perplexity (PPL), indicating greater contextual challenge for models.

6. Impact, Generalization, and Extensions

MathFusionQA demonstrates that cross-problem instruction fusion supports more robust mathematical generalization and data efficiency than instance-level augmentation. Its fused instructions fill gaps in the mathematical problem space (as shown via t-SNE visualizations), increasing diversity and representation of compositional tasks. Scaling analysis suggests that accuracy grows logarithmically with the number of fused examples, supporting efficient expansion strategies.

Performance complements rather than replaces existing augmentation schemes: combining MathFusionQA with DART-Math-Hard delivers further improvements, supporting the principle of orthogonal data diversity (Pei et al., 20 Mar 2025).

Practical application guidelines include embedding-based retrieval for pairing, teacher LLMs for fusion and solution synthesis, rigorous LLM-based filtering, and integration with other augmentation protocols. Potential extensions include fusion of triplets or larger subgraphs of problems, more sophisticated pair selection (e.g., graph-based), and human-in-the-loop solution verification.

A plausible implication is that MathFusionQA, by endowing LLMs with relational and compositional numeracy, advances the frontier of machine mathematical reasoning, particularly in education, research automation, and knowledge retrieval domains where multi-stage problem-solving is ubiquitous. The framework also highlights key limitations—such as residual dependence on teacher LLM consistency, pairwise-only fusion, and limitations of embedding-based retrieval—that remain open areas for future study.

7. Relationship to Other Mathematical QA Systems

MathFusionQA draws conceptually from earlier math-aware QA systems such as MathQA (Schubotz et al., 2019), which focused on natural language query parsing, Wikidata formula retrieval, and symbolic reasoning via SymPy integration. However, MathQA restricts itself to single-formula retrieval and computation, whereas MathFusionQA is specialized for multi-problem compositionality and deep learning-based instruction tuning.

MathFusionQA may be considered a complementary layer: base symbolic systems such as MathQA provide precise formula retrieval and computability, while MathFusionQA-style datasets and training protocols imbue LLMs with flexible, context-sensitive mathematical reasoning capabilities. This synergy supports the development of systems capable of both retrieval-based and generative, compositionally robust mathematical QA (Pei et al., 20 Mar 2025, Schubotz et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MathFusionQA.