Attribution of benchmark performance to genuine reasoning

Determine whether the superior performance of deep learning systems on the GLUE and SuperGLUE benchmarks is attributable to genuine reasoning capabilities rather than to the learning of shallow heuristic patterns.

Background

The paper notes that AI systems have surpassed human-level performance on widely used NLP benchmarks such as GLUE and SuperGLUE. Despite this success, it emphasizes uncertainty about whether these achievements reflect true reasoning ability or merely exploitation of dataset-specific heuristics.

This uncertainty motivates the broader inquiry into mathematical reasoning and math word problem solving, where genuine reasoning is essential and more easily scrutinized than in general NLP benchmark tasks.

References

However it is not understood if such performance is attributed to underlying reasoning capabilities .

— Towards Tractable Mathematical Reasoning: Challenges, Strategies, and Opportunities for Solving Math Word Problems (2111.05364 - Faldu et al., 2021) in Section 1: Introduction

Attribution of benchmark performance to genuine reasoning

Background

References

Related Problems