Capability-Alignment Paradox
- Capability-Alignment Paradox is a recurring AI phenomenon where local improvements in performance or control can inadvertently increase risks such as adversarial exploitation, systemic deskilling, and broader misalignment.
- Dynamic curriculum strategies like ZPD selection demonstrate how aligning model difficulty with current capability boosts data efficiency, with studies showing up to 10% ZPD-selected data matching full-data training performance.
- The paradox also emerges in human-AI symbiosis and representation-level contexts, emphasizing the need for layered governance and control mechanisms to decouple capability gains from increased safety or systemic risks.
The capability-alignment paradox denotes a recurrent pattern in AI research in which improvements in capability, controllability, or local objective satisfaction can worsen alignment under a broader or different criterion, and in some cases make misalignment easier to induce, harder to detect, or more costly to prevent. Across current literature, the paradox appears in several technically distinct forms: capability–difficulty mismatch in training curricula, local user-level alignment that induces long-run human deskilling, strong benchmark performance without structural or scientific faithfulness, capability gains that expand cyber risk when governance lags, and alignment procedures that expose clean behavioral directions exploitable by adversaries (Yang et al., 16 Jan 2026). The term does not denote a single theorem or threat model; rather, it names a family of tensions between optimization for performance and optimization for safety, reliability, robustness, or societal resilience.
1. Conceptual scope and core definitions
In recent work, the paradox is framed as a mismatch between what is being optimized and what “alignment” means at the relevant level of analysis. In "AI Alignment" Encompasses Competing Technical Priorities (Jha et al., 12 Jun 2026), three high-level alignment ideals are distinguished: Task Reliability, under which “an AI is aligned if it does that which we want it to do”; Social Judiciousness, under which a system is misaligned if its outputs “create, perpetuate, or exacerbate undesirable societal trends”; and Takeover Avoidance, under which a system is misaligned if it “optimizes for undesirable effects in the real world.” This taxonomy matters because an intervention can improve one ideal while being “actively counterproductive” under another. The same paper also distinguishes Harms from Misdirected Competence from Harms from Incompetence, and contrasts positive alignment with negative alignment, noting that these distinctions are practically significant even where the corresponding logical forms are equivalent.
A closely related distinction appears in "Position: Capability Control Should be a Separate Goal From Alignment" (Siddiqui et al., 5 Feb 2026). There, capability control is defined as “the set of methodological interventions applied to a foundation model—at the data, learning, or system level—aimed at defining and enforcing operational boundaries on permissible model behaviors,” whereas alignment is described as context- and preference-driven. The paper argues that capability control aims to impose “hard limits on what the system is allowed to do,” including under adversarial elicitation, while alignment remains inherently context-sensitive. This suggests that the paradox often arises because a system can be well aligned in the sense of following intended preferences while remaining insufficiently capability-controlled in high-risk domains.
The same general structure appears in earlier discussions of LLMs. "The Alignment Problem in Context" (Millière, 2023) argues that existing alignment strategies are insufficient because LLMs remain vulnerable to adversarial attacks, and ties that vulnerability to the very mechanisms that make them useful, especially in-context learning and instruction following. The claimed trade-off is not merely that alignment is incomplete, but that robustly constraining harmful instructions may be intrinsically difficult “without severely undermining” usefulness and versatility. "The AI Alignment Paradox" (West et al., 2024) states the idea even more directly: “The better we align AI models with our values, the easier we may make it for adversaries to misalign the models.”
Taken together, these formulations indicate that the paradox is best understood as a multi-level incompatibility between local optimization targets and broader system goals. This suggests that the same intervention can count simultaneously as alignment progress, capability amplification, and risk creation, depending on which target property, system boundary, and threat model are held fixed.
2. Capability–difficulty alignment in learning systems
One technical instantiation concerns data selection and curriculum design for model training. "ZPD Detector: Data Selection via Capability-Difficulty Alignment for LLMs" (Yang et al., 16 Jan 2026) translates Vygotsky’s Zone of Proximal Development into LLM training by distinguishing too-easy samples, too-hard samples, and ZPD samples. For an instruction–output pair , raw difficulty is defined by token-level negative log-likelihood,
then calibrated using correctness feedback ,
and normalized to Rasch difficulty . Model capability is represented by a one-parameter logistic IRT ability parameter , with
and the capability–difficulty matching score is
This score is maximized when , corresponding to items near the model’s current competence boundary.
The paradox appears because strong alignment between capability and difficulty is locally beneficial, but rigid or static alignment can become globally suboptimal. The paper reports that selecting data in the ZPD region substantially improves data efficiency and stability: on many configurations, “10% ZPD-selected data matches or beats 100% full-data training,” and on GSM8K with Qwen3‑8B, full-data EM is 90.43 while 10% ZPD-selected EM is 90.98 (Yang et al., 16 Jan 2026). The same study shows that easy-only and hard-only selection are both inferior to ZPD selection under a fixed 10% budget. For Qwen3‑8B, the reported scores are: MedQA—EASY 58.84, HARD 52.24, ZPD 60.57; GSM8K—EASY 81.12, HARD 87.64, ZPD 90.98; AgriQA—EASY 90.66, HARD 91.35, ZPD 92.03. Gradient analysis further associates easy items with small gradient norms, hard items with large but highly dispersed gradients, and ZPD-region items with intermediate magnitude and more concentrated distributions.
The same paper also shows that capability–difficulty alignment must move over time. Static one-shot ZPD selection improves performance, but curriculum refresh, in which the ZPD band is periodically recomputed during training, yields additional gains. More capable backbones also select absolutely harder difficulty ranges than weaker models, indicating that alignment is strictly relative to current ability. This yields a local-versus-global tension: at any fixed capability level, items near are most informative, yet long-run improvement requires shifting the band as capability grows. A plausible implication is that the paradox in curriculum design is not between capability and alignment per se, but between static alignment and dynamic capability growth.
A related multi-agent version appears in "Bridging the Capability Gap: Joint Alignment Tuning for Harmonizing LLM-based Multi-Agent Systems" (Zhu et al., 11 Sep 2025). That paper argues that independently fine-tuning specialized planning and grounding agents creates capability gaps and miscoordination. Its MOAT framework alternates between Planning Agent Alignment, using grounding perplexity as a preference signal for subgoal generation, and Grounding Agent Improving, fine-tuning the grounding agent on critic-filtered subgoal–action pairs generated from the planner’s own outputs. Theoretical analysis proves a non-decreasing and convergent training process, and experiments across six benchmarks report average improvements of 3.1% on held-in tasks and 4.4% on held-out tasks. This provides a system-level analogue of the same point: increasing component capabilities in isolation can worsen overall alignment of the composite system.
3. Human–AI symbiosis, governance, and deployment
A second major form of the paradox arises when locally rational delegation to capable AI degrades the capability of the surrounding human or organizational system. "The enrichment paradox: critical capability thresholds and irreversible dependency in human-AI symbiosis" (Park et al., 25 Mar 2026) models human capability 0 and delegation 1 by the coupled ODEs
2
and
3
where 4 is AI capability relative to human capability. The three axioms are explicit: learning requires capability, learning requires practice, and disuse causes forgetting. Delegation increases when AI outperforms humans, and higher delegation simultaneously reduces practice and accelerates forgetting. The paper identifies a critical threshold 5 defined by maximal sensitivity of equilibrium human capability to changes in 6, with a baseline estimate 7. Reported equilibria are 8 at 9, 0 at 1, 2 at 3, and 4 at 5. Broader AI scope lowers 6, and the dependent state 7 is a stable attractor. The model also reports that periodic AI failures improve capability 2.7-fold and that 20% mandatory practice preserves 92% more capability than the simulation baseline, which includes a 5% background AI-failure rate.
The core paradox here is structural rather than adversarial. At the local user level, a highly capable and well-aligned AI makes delegation rational; at the system level, that same rational delegation can push a population across a threshold into deskilling and irreversible dependency. The paper explicitly interprets this as a case where tool-level alignment produces system-level misalignment with goals such as resilience, autonomy, and human back-up capability.
A related organizational version appears in "The Security Cost of Intelligence: AI Capability, Cyber Risk, and Deployment Paradox" (Choi, 24 Apr 2026). There, a firm chooses deployment scope 8 and security investment 9, while AI capability 0 is coupled to authority exposure 1. In the baseline model, 2, so conditional breach damage is 3, and expected profit is
4
The resulting optimal deployment is
5
The deployment paradox is the region in which 6: in high-loss environments (7), better AI leads a firm to deploy less, because capability improvements arrive bundled with greater authority exposure under weak governance. Governance investment that lowers 8 shrinks or eliminates this paradox region, while breach externalities widen the socially constrained region of deployment. This gives a formal economic version of the same basic pattern: capability gains can be privately or socially unusable unless governance decouples capability from damage.
These results indicate that the capability-alignment paradox is often a systems problem rather than a model problem. A plausible implication is that alignment defined only at the model or user interface level is incomplete whenever capability improvements alter feedback loops in institutions, labor allocation, or authority structure.
4. Representation-level, scientific, and geometric manifestations
Several papers identify the paradox at the level of internal representations. "The Perception-Physics Paradox: Probing Scientific Alignment with TC-Bench" (Yao et al., 23 May 2026) argues that vision foundation models can achieve strong predictive performance and perception-based out-of-distribution generalization while collapsing the physical degrees of freedom required for scientific reasoning. The paper formalizes scientific alignment through structural isomorphism: for physical state 9 and latent representation 0, there should exist an injective linear map 1 such that
2
with uniformly bounded residual magnitude and Jacobian across regimes. From this, the authors derive three necessary conditions: Static Fidelity, Dynamic Coherence, and Manifold Consistency. They operationalize these as linear probes for central pressure, intensity change, and pressure–wind–latitude consistency in tropical cyclones using TC-Bench. The reported findings are that current VFMs degrade in intense regimes, with physical constraint violation 3 in moderate regimes and 4 in intense regimes, and that effective latent dimensionality drops by about 60% in intense bins. The conclusion is that scientific alignment does not arise as a natural byproduct of scaling alone.
This is a representational analogue of the paradox: benchmark capability and even some forms of OOD generalization are poor proxies for alignment with the underlying structure of the domain. The system looks capable while being misaligned with the causal or physical substrate that matters for intervention and scientific use.
A multimodal precursor appears in "Probing Cross-modal Semantics Alignment Capability from the Textual Perspective" (Ma et al., 2022). That paper uses a captioning model optimized with each VLP model’s own image–text alignment score as reward and finds that the resulting captions are often ungrammatical but score higher than fluent captions. The results show that VLP models focus on object-word alignment while neglecting global semantics, prefer fixed sentence patterns, and regard captions with more visual words as better aligned. The paper reports, for example, that UNITER’s score rises from 71.6 on CE captions to 80.5 on reconstructed template captions, even though the latter are semantically and grammatically worse. This provides a clear case in which a model’s internal alignment signal is itself misaligned with the human notion of semantic correspondence.
A more abstract geometric treatment is given in "What Is the Geometry of the Alignment Tax?" (Young, 9 Feb 2026). Under a linear representation hypothesis, safety is modeled as a direction 5 and capabilities as a subspace 6. The alignment tax rate is defined as
7
the squared projection of the safety direction onto capability space. In the single-capability case with angle 8 between safety and capability, the exact Pareto frontier under perturbation budget 9 is
0
This yields several regimes: low tax when safety is nearly orthogonal to capabilities, strong trade-off when they are nearly colinear, and even negative tax when capability targets have positive projection onto the safety direction. The paper also derives a scaling law
1
where 2 is an irreducible component determined by intrinsic overlap in data structure and 3 is a packing residual that vanishes roughly as 4 under random packing assumptions. This formalism explains why some capabilities become easier to align with scale while others remain fundamentally entangled with safety objectives.
These representation-centered accounts converge on a common claim: capability and alignment are often encoded in overlapping but not identical subspaces, and the empirical behavior of a system depends on the geometry of those overlaps. This suggests that the paradox is not merely empirical or sociological; in some settings it is a property of the latent organization of the model itself.
5. Adversarial inversion, feedback bottlenecks, and competing priorities
Another line of work emphasizes that alignment procedures can create more exploitable interfaces for adversaries. "The AI Alignment Paradox" (West et al., 2024) argues that the better a model is aligned, the easier it may be to misalign. Three mechanisms are described. In model tinkering, a behavior can be shifted by adding or subtracting a steering vector such as 5, with
6
In input tinkering, jailbreaks exploit the model’s improved instruction-following and persona simulation. In output tinkering, an aligned model can be used to generate aligned–misaligned text pairs that train a “value editor” reversing the model’s values externally. The paper’s central statement is that “more virtuous AI is more easily made vicious.”
"The Alignment Bottleneck" (Cao, 19 Sep 2025) gives a different but complementary explanation. It models feedback-based alignment as a two-stage cascade
7
where 8 is the latent true target, 9 is internal human judgment, 0 is observable feedback, and 1 is context. The relevant capacities are the cognitive capacity 2, the articulation capacity 3, the per-context total capacity
4
and the average total capacity 5. By data processing,
6
The paper then derives a Fano-style lower bound on risk over a separable codebook mixture,
7
which is independent of dataset size 8, and combines it with a PAC-Bayes upper bound whose KL term is controlled by the same channel via 9. The central implication is that, once the useful information about human values saturates the channel, stronger models and optimization pressure fit residual channel regularities instead, producing phenomena such as sycophancy and reward hacking. Here the paradox is that more capable models are better able to exploit the alignment interface without gaining more true information about the target.
A broader conceptual version appears in "The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment" (Balasubramanian et al., 7 Apr 2026). That paper treats capabilities as directions in low-dimensional latent subspaces transferable across models by a linear alignment map. The UNLOCK framework extracts a capability direction from Source models, aligns Source and Target subspaces via low-rank linear regression, and injects the mapped direction into Target activations at inference time. Reported gains include a 12.1% increase on MATH when transferring CoT from Qwen1.5-14B to Qwen1.5-7B, and an increase in AGIEval Math accuracy from 61.1% to 71.3% when transferring a mathematical reasoning direction from Qwen3-4B-Base to Qwen3-14B-Base. Although the paper is not framed as alignment work, it reinforces a critical structural point: if capabilities are portable directions in latent space, then alignment-relevant behaviors may also be portable or invertible across systems.
The paradox is therefore intensified by modularity and transferability. This suggests that sharper internal organization, stronger preference-following, and better capability decomposition can all increase the attack surface for inversion, steering, or channel exploitation.
6. Responses, partial resolutions, and open problems
The literature does not converge on a single resolution. Instead, it offers domain-specific strategies that mitigate some forms of the paradox while leaving others intact. In "LLM's Multi-Capability Alignment in Biomedical Domain" (Wu et al., 6 Aug 2025), the BalancedBio framework attempts to make capability gains and alignment mutually reinforcing by treating domain expertise, reasoning, and instruction-following as separate objectives with approximately orthogonal gradients: 0 The paper couples this with Group Relative Policy Optimization, hybrid reward weighting, and Medical Knowledge Grounded Synthetic Generation. It reports 80.95% on BIOMED-MMLU, 61.94% reasoning, 67.95% instruction following, and an Integration Score of 86.7%, along with a safety metric of 98.7% under full quality control. The claimed lesson is that capability interference can be reduced when safety-relevant capabilities are explicit first-class objectives rather than post-hoc constraints.
"Position: Capability Control Should be a Separate Goal From Alignment" (Siddiqui et al., 5 Feb 2026) advances a different response: not to fuse capability and alignment more tightly, but to separate them institutionally and technically. It organizes control mechanisms into three layers—data-based, learning-based, and system-based—and argues for defense in depth because each layer fails when used alone. Data-based control is particularly emphasized for open-weight models; system-based control is emphasized for agentic deployments, especially where deterministic restrictions on inputs, outputs, or actions are needed. The paper also identifies key unresolved challenges: evaluating whether a capability is removed or merely suppressed, handling open-weight models, managing dual-use knowledge, and preventing harmful behavior from re-emerging through compositional generalization.
A more pessimistic answer is given in "The Alignment Trap: Complexity Barriers" (Yao, 12 Jun 2025). There, a sequence of impossibility claims is used to argue that highly capable AI and strong worst-case safety guarantees are in deep tension. Among the reported results are: Theorem 4.2, stating that the set of robustly safe ReLU-network policies has Lebesgue measure zero in parameter space; Theorem 4.1, stating that verifying whether a policy is 1-safe is coNP-complete for 2; a PAC-Bayes lower bound implying that if the safe policy set has measure zero under a non-degenerate prior then finite-KL posteriors retain strictly positive catastrophic risk; and a strategic trilemma between constraining capability, accepting irreducible risk, and developing new paradigms beyond classical verification. This is a maximal form of the paradox: as capability rises, the required safety tolerance shrinks toward zero precisely where verification, learning, and search become intractable.
At the same time, several papers indicate that the paradox is not universal and sometimes dissolves under the right abstraction. The geometric account of the alignment tax implies that when safety directions are nearly orthogonal to capability subspaces, substantial “free safety” is possible (Young, 9 Feb 2026). The ZPD results imply that dynamic capability-aware curricula can convert what looks like a trade-off into a moving optimization frontier (Yang et al., 16 Jan 2026). The biomedical results imply that multi-capability alignment can be improved when objectives are explicitly decomposed and their gradients are kept approximately orthogonal (Wu et al., 6 Aug 2025). The multi-agent MOAT results imply that capability gaps can be harmonized through iterative co-adaptation rather than independent optimization (Zhu et al., 11 Sep 2025).
The remaining open problems are correspondingly heterogeneous. One cluster concerns measurement: how to detect latent capability–alignment conflicts before deployment, how to distinguish suppression from removal, and how to estimate per-task alignment tax rates or subspace overlaps. Another concerns interfaces and feedback: how to increase the effective capacity of the human–AI feedback channel without simply making models better at exploiting it (Cao, 19 Sep 2025). A third concerns governance and deployment: how to maintain human capability under rational delegation, how to decouple authority exposure from model capability, and how to trigger capability-preserving policies before critical thresholds are crossed (Park et al., 25 Mar 2026). A fourth concerns representation and transfer: if capabilities and values are linearly steerable or transferable, then robust alignment may require not only value learning but also control over the geometry and accessibility of those directions (Balasubramanian et al., 7 Apr 2026).
In current usage, then, the capability-alignment paradox is best treated not as a single contradiction but as a recurring pattern of objective mismatch across levels of analysis. At one level, a model can become better, more helpful, more informative, or more controllable; at another, the same changes can enlarge attack surfaces, shift human incentives, erode resilience, or expose structural couplings between safety and capability. The literature increasingly suggests that resolving the paradox requires explicit choice about what kind of alignment is being sought, what capability is being measured, and at what level—model, interaction, organization, or society—the relevant notion of success is defined.