Capability Trap: Non‑Monotone Welfare in Agent Capability
Determine whether, assuming the Goodhart–Campbell transition holds, the principal’s welfare W(B) as a function of agent capability B is non‑monotone, specifically that W(B) is strictly increasing for B < B* but may decrease for B in a neighborhood above B*, thereby producing a capability trap in which increased capability reduces welfare due to evaluation degradation and reallocation of effort to manipulation.
References
Conjecture 2 (Non-Monotone Welfare). If Conjecture 1 holds and the Goodhart-Campbell transition is sufficiently sharp, then the principal's welfare W(B) may be non-monotone in agent capability B:
- For B < B*: W(B) is strictly increasing in B. Capability growth translates directly into welfare improvement.
- For B in a neighborhood above B*: W(B) may be decreasing in B. The welfare loss from evaluation degradation and effort reallocation to manipulation may exceed the welfare gain from increased total capability.
— Reward Hacking as Equilibrium under Finite Evaluation
(2603.28063 - Wang et al., 30 Mar 2026) in Conjecture 2, Section 6.4