Cause of few-shot degradation with APE-generated instructions

Determine whether the observed performance degradation when adding in-context examples for the Rhymes, Large Animal, and Second Letters tasks after prepending instructions generated by Automatic Prompt Engineer (APE) is caused by overfitting of the APE-selected instructions to the zero-shot setting, which would explain their poor performance in the few-shot case.

Background

The authors evaluate APE-generated instructions in a few-shot in-context setting by prepending the instruction to in-context demonstrations. While performance improves or matches on 21 of 24 tasks, three tasks (Rhymes, Large Animal, and Second Letters) experience performance drops.

They explicitly conjecture that this degradation may be due to the selected instructions being overfit to the zero-shot learning scenario, rendering them less effective when in-context examples are added. This causal explanation remains unverified and motivates a targeted investigation.

References

Counter-intuitively, adding in-context examples for Rhymes, Large Animal, and Second Letters hurts model performance. We conjecture that it may be because the selected instructions overfit the zero-shot learning scenario and thus do not perform well on the few-shot case.

Large Language Models Are Human-Level Prompt Engineers (2211.01910 - Zhou et al., 2022) in Instruction Induction, Few-shot In-context Learning