Causes of the performance gap between CodeGen and Codex under PoT prompting
Determine whether insufficient pre-training data and smaller model size are the primary causes of the large performance gap between the open-source CodeGen models (codegen-16B-mono and codegen-16B-multi) and OpenAI Codex (code-davinci-002) when applying Program-of-Thoughts prompting on mathematical reasoning benchmarks such as GSM8K and SVAMP.
Sponsor
References
A concerning fact we found is that the open source model like codegen is significantly behind across different benchmarks. We conjecture that such a huge gap could be attributed to non-sufficient pre-training and model size.
— Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks
(2211.12588 - Chen et al., 2022) in Section 3.3 Ablation Studies – Backend Ablation