Robustness of optimizer behavior versus alignment training in large language models
Determine, for large language models that are trained both to act as optimizers and to align with human intent, which of these two trained properties—optimizer behavior or alignment with human intent—tends to be more robust under deployment and which is more likely to cause failures in practice.
References
LLMs, prior to reinforcement learning, are dynamical systems, but not optimizers. They have to be trained to act as an optimizer, and trained to align with human intent. It is not clear which of these trained properties will tend to be more robust, and which will be most likely to cause failures.
— The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity?
(2601.23045 - Hägele et al., 30 Jan 2026) in Section 1 (Introduction)