Does the teacher-forcing failure generalize to standard text-generation tasks?
Determine whether the training-time failure mechanisms of next-token prediction under teacher-forcing—specifically, Clever Hans cheating and the resulting indecipherable early tokens—observed on the path-star graph path-finding task, generalize to run-of-the-mill text-generation tasks.
Sponsor
References
It is also unclear if it generalizes to run-of-the-mill text-generation tasks.
— The pitfalls of next-token prediction
(2403.06963 - Bachmann et al., 2024) in Section: Limitations