Reference-Dependent CEG Measurement

Updated 30 November 2025

Reference-dependent CEG measurement is a metric that compares algorithm efficiency by relating compute budgets to a chosen baseline, highlighting inherent scale dependence.
It demonstrates that observed gains, such as those from LSTM to Transformer, vary with compute scale due to differences in scaling exponents.
Empirical methods and theoretical insights show that while individual innovations offer modest improvements, scale-dependent effects largely drive overall performance gains.

Reference-dependent CEG (Compute-Equivalent Gain) measurement quantifies the efficiency improvements of algorithms as a function of compute and baseline algorithm, making explicit the dependence of observed "algorithmic progress" on the reference choice. Originally developed to clarify the nature, scale, and relativity of algorithmic gains—particularly in the context of AI training—reference-dependent CEG highlights that measures of efficiency can be inherently tied to the baselines and operational regimes used, as opposed to being objective, scale-invariant quantities (Gundlach et al., 26 Nov 2025). The concept is applicable in both empirical performance measurement and theoretical accounts of algorithmic progress, and draws analogies to "reference-dependent" preferences in behavioral economics as formalized in market contexts (Xu, 27 May 2025).

1. Mathematical Definition and Fundamental Properties

Let $A$ and $B$ denote two training algorithms. The compute-equivalent gain (CEG) function $f_{A\to B}(C)$ is defined such that for every compute budget $C$ , running $A$ on $C$ FLOPs and $B$ on $C/f_{A\to B}(C)$ FLOPs yields the same evaluation performance: $\text{CEG}(A\to B; P) = \frac{\text{FLOPs needed by }A\text{ to reach }P}{\text{FLOPs needed by }B\text{ to reach }P}$ If $A$ and $B$ have scaling laws of the form

$P_A(C)=\alpha_A C^{-\beta_A},\quad P_B(C)=\alpha_B C^{-\beta_B}$

then for fixed performance $P$ and compute $C$ , the compute required by $B$ to match $A$ at $C$ is

$C_B(C) = \left(\frac{\alpha_B}{\alpha_A}\right)^{1/\beta_B} C^{\beta_A/\beta_B}$

yielding a reference-dependent multiplier

$f_{A\to B}(C) = \frac{C}{C_B(C)} = K\, C^{(\beta_B-\beta_A)/\beta_B}$

where $K$ is a constant depending on $\alpha_A, \alpha_B, \beta_B$ .

A key property is that $f_{A\to B}(C)$ is only constant (scale-invariant) if the scaling exponents are equal ( $\beta_A=\beta_B$ ); otherwise, the gain grows (or shrinks) polynomially with $C$ (Gundlach et al., 26 Nov 2025).

2. Reference Dependence and Baseline Sensitivity

Reference-dependent CEG exposes the non-invariance in measures of progress: the apparent efficiency gains attributed to a new algorithm depend crucially on the reference baseline. For instance, when comparing LSTMs ( $\beta_{LSTM}\approx0.058$ ) to Transformers ( $\beta_T\approx0.094$ ), the difference in exponents produces a scale-dependent gain $f_{LSTM\to T}(C)\sim C^{0.38}$ —meaning that at larger compute, the perceived progress is substantially amplified. In contrast, comparing Mixture-of-Experts to dense Transformers (with similar $\beta$ ) produces a near-constant gain, demonstrating no scale-dependent progress (Gundlach et al., 26 Nov 2025).

This structure means that claims about "overall algorithmic progress" (e.g., $6930\times$ improvement from 2012–2023) are operationally meaningless unless both the baseline and regime are clearly specified.

3. Empirical Methodologies for CEG Measurement

Small-scale ablation studies quantify "scale-invariant" gains by systematically removing individual innovations (e.g., activation functions or optimizers) and measuring the FLOPs required to reach a fixed performance threshold. Results consistently find that no element yields more than $2\times$ – $4\times$ gain alone, and even when stacking multiple changes, sub-multiplicative interactions limit total improvement to $<10\times$ except in rare cases. In contrast, scale-dependent effects (changing scaling exponents via architectural shifts) dominate observed frontier progress.

A typical empirical workflow is:

Define reference and target algorithms $A,B$ .
For each, measure $C$ required to reach a loss $L^*$ , controlling for all confounds.
For scale-invariant variants, calculate per-innovation $f_i=C_{base}/C_{ablated}$ .
Assess sub-multiplicative aggregation of multiple changes (Gundlach et al., 26 Nov 2025).

For scale-dependent analysis, fit scaling exponents and evaluate $f_{A\to B}(C)$ over compute growth trajectories.

4. Extrapolation, Frontier Accounting, and Aggregate Gains

Comprehensive accounts of algorithmic progress combine scale-invariant ablations, scaling-law transitions, and regime-specific optimizations. For instance, the transition from LSTM to Transformer, combined with Chinchilla-optimal data-parameter rebalancing, provides nearly all historical progress when extrapolated to high compute, with a combined gain of $f_{\mathrm{total}} \approx 2.6 \times 3.7 \times 725 \approx 6930\times$ over eleven years (2012–2023). Importantly, the overwhelming majority of this multiplier is attributable to scale-dependent effects, not to incremental algorithmic additions (Gundlach et al., 26 Nov 2025).

A summary table of the components:

Change	Estimated Gain	Scale Dependence
Scale-invariant stack	$2.6\times$	No
Kaplan $\to$ Chinchilla rebalancing	$3.7\times$	Yes ( $C^{0.508}$ )
LSTM $\to$ Transformer exponent	$725\times$	Yes ( $C^{0.38}$ )

5. Theoretical and Interpretive Implications

Reference-dependent CEG measurement reveals that algorithmic efficiency is not an absolute metric but a contingent, context-sensitive function. This parallels the concept of reference-dependent preferences in behavioral economics, where investor behavior is defined by unrealized gains or losses relative to a reference point, as quantified by Capital Gains Overhang (CGO) (Xu, 27 May 2025). Just as high-CGO firms exhibit altered risk-return trade-offs depending on investor reference points, the perceived progress in AI algorithms depends on the "reference" algorithm and the compute regime.

A plausible implication is that classic summaries of "algorithmic progress" (e.g., "X doublings per year") require explicit specification of both baseline algorithm and scale, to avoid misleading or inconsistent comparisons.

6. Contextualization across Domains

While reference-dependent CEG emerged in the analysis of compute scaling and algorithmic progress in AI, its logic generalizes to any domain where progress or efficiency is measured relative to a baseline and subject to scale effects. In market finance, CGO provides a reference-dependent market-level measure of unrealized gains and losses, directly informing behavioral patterns such as risk aversion or risk seeking relative to aggregate reference points (Xu, 27 May 2025). In quantum measurement theory, contextual and reference-dependence shape the interpretation of measurement records in different reference frames (Allam et al., 2023). The unifying insight is the non-invariance of empirical or theoretical gains: all depend fundamentally on reference systems, whether these are algorithms, price anchors, or frames of measurement.

7. Robustness, Practical Recommendations, and Open Problems

Reference-dependent CEG measurement poses fundamental challenges for benchmarking and reporting progress. Best practices require:

Explicit reporting of the reference algorithm and compute regime.
Disaggregation of scale-invariant and scale-dependent contributions.
Avoidance of aggregating gains across transitions with different scaling exponents without clear baseline specification.

Robustness checks include testing alternative baselines, adjusting for sub-multiplicative effects, and examining behavior under different scaling regimes (Gundlach et al., 26 Nov 2025).

Open problems involve formalizing cross-domain analogies between reference-dependent phenomena—whether in human behavior, algorithmic scaling, or quantum measurement—and developing standardized practices for reference specification to ensure comparability of empirical claims. A plausible direction is the development of meta-benchmarks that encode both baseline and scaling law information.