Can preference-based feedback recover correct plans from models that initially cannot plan?
Ascertain whether, and under what conditions, preference-based feedback (such as reinforcement learning from human feedback) can enable a model that initially assigns near-zero probability to correct plans to discover and adopt those plans, despite the limitations of teacher-forcing-based gradients.
Sponsor
References
Furthermore, if we desire that the model be able to generate a solution that can plan ahead of time, it is unclear how a model can go from a complete inability to plan (that may assign near-zero probability to the true plan in an exponential space of solutions), to discovering the correct plan simply through preference-based feedback (see \citep{havrilla2024teaching} for related empirical evidence).
— The pitfalls of next-token prediction
(2403.06963 - Bachmann et al., 2024) in Section: Related Work, Going beyond next-token prediction