Cause of stability advantage for baseline-free self-supervised learning vs. REINFORCE with baseline

Establish whether the improved training stability of the baseline-free self-supervised learning method that retains only executable and correct CUDA code and applies positive-only updates (reward 1 for success, 0 otherwise) is caused by the large proportion of unsuccessful samples during this stage, which would otherwise introduce instability through negative updates when using a REINFORCE variant with a baseline in the CUDA optimization setting.

Background

The paper introduces a self-supervised learning stage that filters generated CUDA code to keep only executable and correct samples and then updates the model using these successes, effectively implementing a REINFORCE-like procedure without a baseline and without negative updates.

Empirically, the authors observe that this approach is more stable than a REINFORCE variant with a baseline but do not provide a formal justification. They explicitly conjecture a causal explanation tied to the abundance of unsuccessful samples and the destabilizing effect of negative updates when a baseline is used.

References

Interestingly, we find this adopted training strategy to be more stable than the REINFORCE variant with baseline applied. We conjecture that this stability arises because during the self-supervised learning stage, a significant proportion of generated instances remain unsuccessful. This approach avoids the potential instability caused by applying negative updates to unsuccessful samples when using a baseline.

— CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning (2507.14111 - Li et al., 18 Jul 2025) in Section 2.2, Self-supervised Learning

Cause of stability advantage for baseline-free self-supervised learning vs. REINFORCE with baseline

Background

References

Related Problems