Cause of stability advantage for baseline-free self-supervised learning vs. REINFORCE with baseline
Establish whether the improved training stability of the baseline-free self-supervised learning method that retains only executable and correct CUDA code and applies positive-only updates (reward 1 for success, 0 otherwise) is caused by the large proportion of unsuccessful samples during this stage, which would otherwise introduce instability through negative updates when using a REINFORCE variant with a baseline in the CUDA optimization setting.
References
Interestingly, we find this adopted training strategy to be more stable than the REINFORCE variant with baseline applied. We conjecture that this stability arises because during the self-supervised learning stage, a significant proportion of generated instances remain unsuccessful. This approach avoids the potential instability caused by applying negative updates to unsuccessful samples when using a baseline.