Determine the effect of optimized prompting on the reasoning gap

Determine whether and by how much optimized prompting strategies—such as chain-of-thought, tree-of-thought, and chain-of-code—reduce the reasoning gap for language models evaluated on MATH() once the benchmark is fully functionalized.

Background

The authors note that current reasoning gap measurements use simple few-shot prompting and that more sophisticated prompting could lower the gap. They plan to fully functionalize MATH() to enable a rigorous assessment of how such prompting strategies affect the gap, marking this as an open question to be resolved with complete coverage.

References

The value of the gap when using optimized prompting such as chain-of-thought (CoT~, tree-of-thought (ToT~, chain-of-code (CoC~, amongst others, could be lower and we will resolve that open question when we have built the100\% functionalized MATH().

— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (2402.19450 - Srivastava et al., 29 Feb 2024) in Introduction (paragraph discussing future 100% functionalization of MATH())

Determine the effect of optimized prompting on the reasoning gap

Sponsor

Background

References

Related Problems