Dice Question Streamline Icon: https://streamlinehq.com

Quantify how post-processing/self-inspection techniques impact the reasoning gap

Quantify the extent to which accuracy improvements from self-inspection and pipeline-level techniques—such as indirect reasoning, self-critique, divide-and-conquer, sampled math or code prompting, planner guided decoding, self-consistency, recursive code template improver, logic guide-driven inference, self-debug, least-to-most prompting, and self-discover—translate into reductions of the reasoning gap.

Information Square Streamline Icon: https://streamlinehq.com

Background

Beyond prompting, the paper reviews methods that wrap the core model in an enhanced inference pipeline. While such techniques often improve accuracy, the authors state it is an open problem to determine how these improvements affect the reasoning gap metric, calling for systematic evaluation on functionalized benchmarks.

References

It is an open problem how much the accuracy improvements using these techniques translate to lowered reasoning gaps.

Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (2402.19450 - Srivastava et al., 29 Feb 2024) in Related Work, Techniques to improve reasoning in language models (Self-inspection or related post-processing)