Do Kondo-gate gains survive at large scale?

Determine whether the compute savings and learning-quality retention achieved by the Kondo gate—which uses delight (the product of advantage and surprisal) to selectively skip backward passes while preserving the performance of the Delightful Policy Gradient—persist when training modern large-scale models (e.g., contemporary large language models).

Background

The paper introduces the Kondo gate, which uses delight (advantage multiplied by surprisal) as a forward-pass screening signal to decide whether a sample warrants an expensive backward pass. Across MNIST, tabular bandits, and transformer token-reversal tasks, the method skips most backward passes while retaining nearly all of the Delightful Policy Gradient’s learning quality, yielding substantial compute savings in backward-pass space.

While the authors analyze limitations such as a gambling regime with high-variance rewards, the broader question is whether these benefits extend beyond small-to-medium-scale settings. Training modern large models, especially LLMs, involves substantially higher backward/forward cost ratios and different data/optimization regimes. The paper therefore poses an explicit open question about the scalability of these gains in modern large-model training, suggesting directions such as distilled delight predictors, adaptive gate schedules, and transfer to RLHF as natural next steps.

References

Beyond that, the central open question is scale: do the same gains survive in modern large-model training?

Does This Gradient Spark Joy?  (2603.20526 - Osband, 20 Mar 2026) in Conclusion (Section 7)