Can preference-based feedback recover correct plans from models that initially cannot plan?

Ascertain whether, and under what conditions, preference-based feedback (such as reinforcement learning from human feedback) can enable a model that initially assigns near-zero probability to correct plans to discover and adopt those plans, despite the limitations of teacher-forcing-based gradients.

Background

The paper discusses reinforcement learning and preference-based methods as potential alternatives to plain teacher-forcing. It notes that, in practice, gradients in such methods still reduce to teacher-forcing on model-generated answers, raising doubts about their ability to overcome planning failures when the model initially lacks the correct plan.

The authors explicitly question whether preference-based feedback alone suffices to move a model from a state of near-zero probability on the correct plan to discovering that plan, especially in exponentially large solution spaces typical of planning problems.

References

Furthermore, if we desire that the model be able to generate a solution that can plan ahead of time, it is unclear how a model can go from a complete inability to plan (that may assign near-zero probability to the true plan in an exponential space of solutions), to discovering the correct plan simply through preference-based feedback (see \citep{havrilla2024teaching} for related empirical evidence).

The pitfalls of next-token prediction  (2403.06963 - Bachmann et al., 2024) in Section: Related Work, Going beyond next-token prediction