Are large language models capable of refinement?
Determine whether contemporary large language models are capable of refining their own prior responses across successive interaction turns, i.e., producing improved outputs over multiple turns after an initial attempt, in the general setting addressed by RefineBench.
References
This begs the question: are LMs even capable of refinement at all? Despite this back-and-forth, the question remains unresolved for three key reasons.
— RefineBench: Evaluating Refinement Capability of Language Models via Checklists
(2511.22173 - Lee et al., 27 Nov 2025) in Section 1 (Introduction)