Do prior conclusions on LM self-refinement hold for new reasoning models?
Ascertain whether conclusions from earlier analyses of large language model refinement (including claims of inability to self-correct without sophisticated feedback) remain valid for newly released reasoning-focused models such as DeepSeek-R1 and related systems.
References
Third, a new class of reasoning LMs has emerged and it is unclear whether the conclusions from prior analyses still hold.
— RefineBench: Evaluating Refinement Capability of Language Models via Checklists
(2511.22173 - Lee et al., 27 Nov 2025) in Section 1 (Introduction)