Train gap-0 language models

Develop training methodologies and model configurations that yield large language models whose reasoning gap equals zero on functionalized benchmarks such as MATH(), meaning the models’ functional accuracy across multiple snapshots matches their static QA accuracy.

Background

The paper defines the reasoning gap as the percentage drop in accuracy when moving from static QA benchmarks to functionalized snapshots that require the same reasoning but with varied inputs. Across state-of-the-art models, large reasoning gaps are observed, indicating substantial reliance on memorization rather than robust derivation. The authors argue that minimizing this gap is essential for true reasoning capability and propose the open problem of training models that achieve gap-0.

References

Here we show that models which anecdotally have good reasoning performance over real-world tasks, have quantifiable lower gaps, motivating the open problem of building {\em ``gap 0''} models.

— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap (2402.19450 - Srivastava et al., 29 Feb 2024) in Abstract; also reiterated in Contributions

Train gap-0 language models

Sponsor

Background

References

Related Problems