Train gap-0 language models
Develop training methodologies and model configurations that yield large language models whose reasoning gap equals zero on functionalized benchmarks such as MATH(), meaning the models’ functional accuracy across multiple snapshots matches their static QA accuracy.
References
Here we show that models which anecdotally have good reasoning performance over real-world tasks, have quantifiable lower gaps, motivating the open problem of building {\em ``gap 0''} models.
— Functional Benchmarks for Robust Evaluation of Reasoning Performance, and the Reasoning Gap
(2402.19450 - Srivastava et al., 29 Feb 2024) in Abstract; also reiterated in Contributions