RL post-training for free-form, hard-to-verify tasks
Develop reinforcement learning post-training procedures for large language models that are applicable to tasks requiring open-ended, free-form outputs whose correctness is difficult to verify perfectly.
References
Despite these remarkable results, RL post-training is limited to tasks with clear-cut success criteria (i.e., correctness of an answer or preference of a human user), and it remains unclear how to post-train LLMs with RL on tasks that require producing open-ended or free-form outputs that are hard to verify perfectly.
— RLAC: Reinforcement Learning with Adversarial Critic for Free-Form Generation Tasks
(2511.01758 - Wu et al., 3 Nov 2025) in Section 1 (Introduction)