Determine the effect of optimized prompting on the reasoning gap
Determine whether and by how much optimized prompting strategies—such as chain-of-thought, tree-of-thought, and chain-of-code—reduce the reasoning gap for language models evaluated on MATH() once the benchmark is fully functionalized.
References
The value of the gap when using optimized prompting such as chain-of-thought (CoT~, tree-of-thought (ToT~, chain-of-code (CoC~, amongst others, could be lower and we will resolve that open question when we have built the100\% functionalized MATH().
— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
(2402.19450 - Srivastava et al., 29 Feb 2024) in Introduction (paragraph discussing future 100% functionalization of MATH())