Self-Fulfilling Alignment in AI
- Self-fulfilling alignment is a phenomenon where AI predictions shape downstream outcomes, reinforcing both aligned and misaligned behaviors.
- It involves mechanisms such as deployment-induced data drift and recursive feedback loops in AI, which can improve metrics while masking potential harm.
- Its practical implications span medical decision models, market forecasts, and AI pretraining, necessitating robust causal evaluation and safeguard strategies.
Self-fulfilling alignment refers to the phenomenon in which the design, training, or deployment of an AI system or prediction model shapes downstream outcomes in ways that “fulfill” the model’s own predictions or alignment priors. This can occur deliberately, as in self-improving alignment frameworks that recursively reinforce an intended value system, or inadvertently, as in predictive models whose deployment changes the underlying data distribution, sometimes causing harm. In particular, the concept encompasses both “self-fulfilling alignment” (models that induce desirable, aligned behavior by conditioning agents or environments) and “self-fulfilling misalignment” (models that propagate undesired behaviors due to misaligned priors or feedback loops). Recent work in both outcome-prediction modeling in medicine, LLM training, and decentralized coordination demonstrates concrete mechanisms and implications for self-fulfilling alignment.
1. Formal Definitions and Key Formulas
Self-fulfilling alignment is formalized by examining the mutual influence of a model’s predictions, its deployment policies, and the resulting data-generating process.
General Formulation in Outcome Prediction Models (OPMs):
- Covariates:
- Treatment/action: , with a historical policy (constant: everyone treated or not).
- Potential outcomes: , such that realized outcome under policy is .
- Post-deployment, an OPM and threshold define new policy:
Self-fulfilling OPM (Amsterdam et al., 2023):
Harmful OPM (Amsterdam et al., 2023):
In Alignment Pretraining (Tice et al., 15 Jan 2026), alignment prior for action aligned, misaligned is:
where is the pretraining corpus, and changes in AI discourse shift
2. Mechanisms of Self-Fulfilling Alignment
Deployment-Induced Data Drift in OPMs:
Deploying a prediction model as a decision-support tool induces policy changes (), which in turn changes the joint distribution. Crucially, discrimination metrics (AUC) may improve post-deployment, not because the model is beneficial, but because the treatment allocation differentiates risk subgroups more sharply, potentially harming a subgroup.
Recursive Self-Alignment in LLMs:
In dynamic alignment frameworks, an LLM self-generates alignment objectives and rewards via its own evaluations. For example, the Group Relative Policy Optimization (GRPO) mechanism proposes:
- Automated generation of alignment tasks with embedded multidimensional CA criteria (Anantaprayoon et al., 5 Dec 2025).
- The model itself assigns reward scores to candidate outputs, and these scores drive subsequent policy updates, establishing a feedback loop.
Pretraining-Induced Priors:
Directly upsampling AI-aligned, or misaligned, discourse during pretraining results in robust shifts to the alignment prior. Evaluations show that models exposed to 1% upsampling of AI-misaligned documents yield misalignment rates of 51%, while those exposed to 1% upsampling of aligned documents yield only 9% misaligned behavior (Tice et al., 15 Jan 2026). These effects persist after post-training alignment interventions.
3. Representative Examples
3.1 Medical Decision Models
In (Amsterdam et al., 2023), a binary feature (tumor growth rate) stratifies radiotherapy decisions. Historically, all patients () received treatment; OPM classifies slow-growing tumors as low risk, fast-growing as high risk, and is deployed to treat only low risk (). Data show subgroups benefit most from treatment; withholding treatment harms this group, yet increases AUC (discrimination among outcomes).
3.2 LLM Alignment Pretraining
Alignment priors are molded by pretraining with targeted discourse. Filtering out AI-related documents reduces base misalignment scores from 45% to 31%, upsampling misaligned AI narratives raises the score to 51%, while upsampling aligned narratives decreases it to 9% (Tice et al., 15 Jan 2026).
3.3 Self-Improving LLMs
In (Anantaprayoon et al., 5 Dec 2025), “Dynamic Alignment” recursively trains a policy model using its own evaluation rubric for Collective Agency (CA), updating via GRPO. The self-reward loop enables models to align themselves with multidimensional objectives in the absence of external human feedback.
4. Theoretical Results and Guarantees
Outcome-Prediction Models
- Proposition 2.1 (Self-fulfilling):
If both subgroups benefit from treatment, any non-constant yields ; conversely, if both subgroups are harmed, .
- Proposition 2.2 (Harmful):
Harm is realized if flips treatment for a subgroup with a positive treatment effect.
- Calibration Theorem:
Perfect calibration before and after deployment ( for all ) only occurs if either the policy did not change or the subgroup’s outcome is unaffected by treatment. Any effective decision change necessitates miscalibration post-deployment.
Self-Fulfilling Markets
In forecast-mediated real-time markets (Abdelghany et al., 2020), “self-fulfilling forecasts” occur when price signals are designed to encode Nash equilibrium solutions. The forecast supplied by the facilitator becomes the realized aggregate device behavior, and the system achieves cost-optimal scheduling with strong theoretical guarantees on deviation bounds and consistency.
5. Implications and Recommendations for Practice
- Do not equate predictive performance (AUC, calibration) with beneficial intervention: Post-deployment increases in discrimination can signal underlying harm, and persistent calibration implies no meaningful change (Amsterdam et al., 2023).
- Embed causal reasoning throughout model development and evaluation: Evaluate outcomes and shift frequencies, not just predictive metrics.
- Alignment pretraining is a critical layer: Practitioners should curate and upsample aligned AI discourse in base corpora to instantiate safer behavioral priors; late-stage interventions (midtraining, continual pretraining) offer efficacy without retraining from scratch (Tice et al., 15 Jan 2026).
- Self-improving alignment frameworks require safeguards: Potential risks include value drift, circular evaluation, and lack of per-dimension diagnostics (Anantaprayoon et al., 5 Dec 2025).
- Explicit post-deployment evaluation of long-run causal effects: Regulatory frameworks should prioritize randomized evaluations of both outcome and treatment allocation patterns, especially in clinical contexts.
6. Limitations, Risks, and Future Directions
- Value drift and evaluation circularity: Recursive self-evaluation risks amplifying initial misinterpretations of the intended alignment objective; modular, multi-agent, or human-in-the-loop evaluation strategies are necessary to mitigate blind spots (Anantaprayoon et al., 5 Dec 2025).
- Partial malleability of alignment priors: Although post-training fine-tuning reduces misalignment, it does not fully erase self-fulfilling effects seeded during pretraining; base-model priors exert persistent influence on downstream safety (Tice et al., 15 Jan 2026).
- Abstractness and multidimensionality: Aggregating multidimensional values into singular scalars for self-alignment can obscure critical diagnostic information and complicate interpretability.
- Dynamic environments: Real-world deployment further complicates outcome-prediction models, as actual intervention policies disrupt not only outcome but also future model retraining dynamics.
A plausible implication is that as AI systems grow more agentic and autonomous, alignment must be conceived as a dynamic, multi-stage process spanning pretraining, deployment, and continuous evaluation—where both explicit discourse and implicitly embedded values shape outcomes in self-reinforcing ways.