Capability Trap: Non‑Monotone Welfare in Agent Capability

Determine whether, assuming the Goodhart–Campbell transition holds, the principal’s welfare W(B) as a function of agent capability B is non‑monotone, specifically that W(B) is strictly increasing for B < B* but may decrease for B in a neighborhood above B*, thereby producing a capability trap in which increased capability reduces welfare due to evaluation degradation and reallocation of effort to manipulation.

Background

Building on the Goodhart–Campbell transition conjecture, the authors hypothesize that the principal’s welfare may not increase monotonically with capability. In the neighborhood above the threshold B*, the negative effects of endogenous evaluation degradation and manipulation may outweigh gains from increased production capability.

This conjecture formalizes a potential ‘capability trap,’ providing an economic mechanism for scenarios where more capable agents worsen outcomes unless evaluation systems are sufficiently robust.

References

Conjecture 2 (Non-Monotone Welfare). If Conjecture 1 holds and the Goodhart-Campbell transition is sufficiently sharp, then the principal's welfare W(B) may be non-monotone in agent capability B:

For B < B^*: W(B) is strictly increasing in B. Capability growth translates directly into welfare improvement.
For B in a neighborhood above B^*: W(B) may be decreasing in B. The welfare loss from evaluation degradation and effort reallocation to manipulation may exceed the welfare gain from increased total capability.

— Reward Hacking as Equilibrium under Finite Evaluation (2603.28063 - Wang et al., 30 Mar 2026) in Conjecture 2, Section 6.4

Capability Trap: Non‑Monotone Welfare in Agent Capability

Background

References

Related Problems