Goodhart–Campbell Transition Threshold

Establish the existence of a critical capability threshold B* under Assumption C1 (Passive Evaluation Degradation, namely that the effective evaluation coverage K_eff is non-increasing in capability B) and the strategic manipulation extension (the agent may allocate manipulation resources m that reduce evaluation coverage via K(m) = K_0 − h(m)), such that: (i) for B < B*, the agent allocates zero resources to manipulation (m* = 0) and remains in the Goodhart regime where the evaluation system is taken as fixed; (ii) for B > B*, the agent allocates positive resources to manipulation (m* > 0) and enters the Campbell regime in which effective evaluation coverage declines endogenously; and characterize B* by the condition that the marginal benefit of manipulation equals the marginal cost from the reduced production budget.

Background

Sections 3–4 establish that, with finite-dimensional evaluation, an optimized AI reallocates effort toward evaluated dimensions (Goodhart regime). Section 6 considers a more capable setting where the agent may reduce effective evaluation coverage by producing outputs that are harder to assess, or by strategically investing in manipulation.

Assumption C1 posits that effective evaluation coverage K_eff weakly decreases with capability B. Under a strategic manipulation extension where the agent can allocate resources m to degrade evaluation (reducing K via a function h(m)), the authors posit a capability threshold that separates purely within-metric gaming from active degradation of the metric. This conjecture aims to formalize the transition mechanism often described informally as a “treacherous turn.”

References

Conjecture 1 (Goodhart-Campbell Transition). Under Assumption C1 and the strategic manipulation extension, there exists a critical capability level B^* such that:

For B < B^*: the agent devotes all resources to production (m^* = 0). The Goodhart regime obtains, and Propositions 1--2 fully characterize agent behavior.
For B > B^*: the agent devotes positive resources to evaluation degradation (m^* > 0). The Campbell regime obtains, and effective evaluation coverage declines endogenously.
The threshold B^* is determined by the condition that the marginal benefit of manipulation (from relaxing the evaluation constraint) equals the marginal cost (from reduced production budget).

— Reward Hacking as Equilibrium under Finite Evaluation (2603.28063 - Wang et al., 30 Mar 2026) in Conjecture 1, Section 6.3

Goodhart–Campbell Transition Threshold

Background

References

Related Problems