Generalization of Math LLMs Beyond the MATH Dataset
Determine whether mathematical large language models that are trained primarily with general language modeling and fine-tuned or evaluated on the MATH dataset can reliably solve problems that exceed the difficulty of the MATH benchmark and problems that are not represented in that dataset, thereby ascertaining their out-of-distribution generalization capability.
References
Their ability to handle problems that exceed the difficulty of MATH or those not present in the dataset remains unclear.
— Foundation of Intelligence: Review of Math Word Problems from Human Cognition Perspective
(2510.21999 - Huang et al., 24 Oct 2025) in Section 6.3 (Other Mathematical Problems)