Generalization of Math LLMs Beyond the MATH Dataset

Determine whether mathematical large language models that are trained primarily with general language modeling and fine-tuned or evaluated on the MATH dataset can reliably solve problems that exceed the difficulty of the MATH benchmark and problems that are not represented in that dataset, thereby ascertaining their out-of-distribution generalization capability.

Background

In the discussion of mathematical tasks beyond math word problems, the authors consider recent progress in mathematical LLMs and note that these models are often trained using general language modeling techniques and may benefit from augmentation of training splits within the MATH dataset.

Given this training regime, the authors explicitly state that it remains unclear whether these mathematical LLMs can handle problems that exceed the difficulty of the MATH benchmark or that do not appear in the dataset, highlighting a key unresolved question about their generalization capability beyond the data used for fine-tuning or evaluation.

References

Their ability to handle problems that exceed the difficulty of MATH or those not present in the dataset remains unclear.

— Foundation of Intelligence: Review of Math Word Problems from Human Cognition Perspective (2510.21999 - Huang et al., 24 Oct 2025) in Section 6.3 (Other Mathematical Problems)

Generalization of Math LLMs Beyond the MATH Dataset

Background

References

Related Problems