Causes of the performance gap between CodeGen and Codex under PoT prompting

Determine whether insufficient pre-training data and smaller model size are the primary causes of the large performance gap between the open-source CodeGen models (codegen-16B-mono and codegen-16B-multi) and OpenAI Codex (code-davinci-002) when applying Program-of-Thoughts prompting on mathematical reasoning benchmarks such as GSM8K and SVAMP.

Background

In the ablation study evaluating different backbone models for Program-of-Thoughts (PoT) prompting, the authors compare Codex (code-davinci-002), GPT-3.5-turbo, text-davinci-002, and several open-source code models including CodeGen-16B-mono and CodeGen-16B-multi.

They report that open-source CodeGen models perform significantly worse than Codex on benchmarks like GSM8K and SVAMP. The authors specifically speculate that the deficit may stem from limited pre-training and model size, raising an unresolved question about the true causes of the gap.

References

A concerning fact we found is that the open source model like codegen is significantly behind across different benchmarks. We conjecture that such a huge gap could be attributed to non-sufficient pre-training and model size.

Program of Thoughts Prompting: Disentangling Computation from Reasoning for Numerical Reasoning Tasks (2211.12588 - Chen et al., 2022) in Section 3.3 Ablation Studies – Backend Ablation