Cause of degraded vector-retrieval performance for smaller code LLMs

Determine whether the significantly worse vector-retrieval baseline performance observed with StarCoder2-15B relative to a larger model (GPT-4) is caused by heightened sensitivity to erroneous syntax in the prompt created by chunk truncation, and establish to what extent chunk truncation–induced syntax errors drive this performance gap.

Background

In the Hazel experiments, the authors compared GPT-4 and StarCoder2-15B under multiple contextualization configurations, including a vector-retrieval baseline that chunks repository code and retrieves semantically similar snippets. They observed that the vector-retrieval baseline performs substantially worse with the smaller StarCoder2-15B model than with GPT-4.

They explicitly conjecture a cause: that smaller completion models are more sensitive to erroneous syntax introduced by chunk truncation in the prompt. Validating this conjecture would clarify whether improving chunking or sanitizing retrieved snippets can reduce the performance gap between small and large models.

References

Vector retrieval baseline performance was significantly worse (in absolute and relative terms) than with the larger model. We conjecture that this is due to a heightened sensitivity to erroneous syntax in the prompt created by chunk truncation.

— Statically Contextualizing Large Language Models with Typed Holes (2409.00921 - Blinn et al., 2 Sep 2024) in Subsection “Hazel StarCoder2-15B Results”

Cause of degraded vector-retrieval performance for smaller code LLMs

Sponsor

Background

References

Related Problems