Are large language models capable of refinement?

Determine whether contemporary large language models are capable of refining their own prior responses across successive interaction turns, i.e., producing improved outputs over multiple turns after an initial attempt, in the general setting addressed by RefineBench.

Background

The paper motivates the need to evaluate LLMs' ability to refine their prior responses, noting conflicting evidence: some early methods suggested refinement works while subsequent analyses concluded it does not. Prior evaluations have largely focused on verifiable tasks such as mathematics or code, leaving open-ended and free-form domains underexplored, and the role of feedback control insufficiently examined.

RefineBench is introduced to test this capability across 11 domains and with checklist-based evaluation under self-refinement and guided refinement. The authors emphasize that, despite progress, the fundamental question of whether LMs can refine at all in broad settings has not been settled.

References

This begs the question: are LMs even capable of refinement at all? Despite this back-and-forth, the question remains unresolved for three key reasons.

— RefineBench: Evaluating Refinement Capability of Language Models via Checklists (2511.22173 - Lee et al., 27 Nov 2025) in Section 1 (Introduction)

Are large language models capable of refinement?

Background

References

Related Problems