Quantify how post-processing/self-inspection techniques impact the reasoning gap
Quantify the extent to which accuracy improvements from self-inspection and pipeline-level techniques—such as indirect reasoning, self-critique, divide-and-conquer, sampled math or code prompting, planner guided decoding, self-consistency, recursive code template improver, logic guide-driven inference, self-debug, least-to-most prompting, and self-discover—translate into reductions of the reasoning gap.
References
It is an open problem how much the accuracy improvements using these techniques translate to lowered reasoning gaps.
— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
(2402.19450 - Srivastava et al., 29 Feb 2024) in Related Work, Techniques to improve reasoning in language models (Self-inspection or related post-processing)