Genuine advancement of LLM mathematical reasoning and reliability of GSM8K metrics
Determine whether the observed improvements in large language models’ accuracy on the GSM8K benchmark correspond to genuine advances in mathematical reasoning capabilities, and ascertain the reliability of performance metrics reported on GSM8K in reflecting true mathematical reasoning ability.
References
While the performance of LLMs on GSM8K has significantly improved in recent years, it remains unclear whether their mathematical reasoning capabilities have genuinely advanced, raising questions about the reliability of the reported metrics.
— GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
(2410.05229 - Mirzadeh et al., 7 Oct 2024) in Abstract