Robustness of optimizer behavior versus alignment training in large language models

Determine, for large language models that are trained both to act as optimizers and to align with human intent, which of these two trained properties—optimizer behavior or alignment with human intent—tends to be more robust under deployment and which is more likely to cause failures in practice.

Background

The paper contrasts two trained properties in LLMs: being trained to act as optimizers and being trained to align with human intent. It motivates the need to understand which property is more durable and which contributes more to failure modes as models become highly capable.

This uncertainty is central to forecasting AI risks, distinguishing between failures due to coherent misalignment (bias-dominated) and inconsistent, goal-incoherent behavior (variance-dominated). Clarifying robustness would inform the prioritization of alignment strategies and risk mitigation.

References

LLMs, prior to reinforcement learning, are dynamical systems, but not optimizers. They have to be trained to act as an optimizer, and trained to align with human intent. It is not clear which of these trained properties will tend to be more robust, and which will be most likely to cause failures.

— The Hot Mess of AI: How Does Misalignment Scale With Model Intelligence and Task Complexity? (2601.23045 - Hägele et al., 30 Jan 2026) in Section 1 (Introduction)

Robustness of optimizer behavior versus alignment training in large language models

Background

References

Related Problems