RL post-training for free-form, hard-to-verify tasks

Develop reinforcement learning post-training procedures for large language models that are applicable to tasks requiring open-ended, free-form outputs whose correctness is difficult to verify perfectly.

Background

Recent progress in post-training has largely focused on tasks with clear success criteria, such as exact-answer correctness or human preference alignment. However, many real-world applications involve open-ended or free-form generation where outputs must satisfy numerous, often implicit, rubrics, making perfect verification challenging and expensive.

Enumerating and combining all relevant rubrics is typically intractable or prompt-specific, and relying on static reward models can lead to reward hacking. This motivates the need for a principled reinforcement learning post-training approach that remains effective when exhaustive verification is impractical.

References

Despite these remarkable results, RL post-training is limited to tasks with clear-cut success criteria (i.e., correctness of an answer or preference of a human user), and it remains unclear how to post-train LLMs with RL on tasks that require producing open-ended or free-form outputs that are hard to verify perfectly.

— RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks (2511.01758 - Wu et al., 3 Nov 2025) in Section 1 (Introduction)

RL post-training for free-form, hard-to-verify tasks

Background

References

Related Problems