Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Evaluating Long Range Dependency Handling in Code Generation Models using Multi-Step Key Retrieval (2407.21049v1)

Published 23 Jul 2024 in cs.CL, cs.AI, and cs.LG

Abstract: As LLMs support larger and larger context sizes, evaluating their ability to make effective use of that context becomes increasingly important. We analyze the ability of several code generation models to handle long range dependencies using a suite of multi-step key retrieval tasks in context windows up to 8k tokens in length. The tasks progressively increase in difficulty and allow more nuanced evaluation of model capabilities than tests like the popular needle-in-the-haystack test. We find that performance degrades significantly (up to 2x) when a function references another function that is defined later in the prompt. We also observe that models that use sliding window attention mechanisms have difficulty handling references further than the size of a single window. We perform simple prompt modifications using call graph information to improve multi-step retrieval performance up to 3x. Our analysis highlights different facets of long-context performance and is suggestive of prompt construction strategies for code completion tools

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Anthropic. Introducing 100K Context Windows, May 2023. URL https://www.anthropic.com/news/100k-context-windows.
  2. OpenAI. New models and developer products announced at DevDay, November 2023. URL https://openai.com/blog/new-models-and-developer-products-announced-at-devday.
  3. Google. Our next-generation model: Gemini 1.5, February 2024. URL https://blog.google/technology/ai/google-gemini-next-generation-model-february-2024/.
  4. Mistral 7B, October 2023. URL http://arxiv.org/abs/2310.06825. arXiv:2310.06825 [cs].
  5. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023. URL http://arxiv.org/abs/2307.09288. arXiv:2307.09288 [cs].
  6. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, June 2022. URL http://arxiv.org/abs/2205.14135. arXiv:2205.14135 [cs].
  7. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints, December 2023. URL http://arxiv.org/abs/2305.13245. arXiv:2305.13245 [cs].
  8. Efficient Memory Management for Large Language Model Serving with PagedAttention, September 2023. URL http://arxiv.org/abs/2309.06180. arXiv:2309.06180 [cs].
  9. Generating Long Sequences with Sparse Transformers, April 2019. URL http://arxiv.org/abs/1904.10509. arXiv:1904.10509 [cs, stat].
  10. Longformer: The Long-Document Transformer, December 2020. URL http://arxiv.org/abs/2004.05150. arXiv:2004.05150 [cs].
  11. Nicole Choi. What is retrieval-augmented generation, and what does it do for generative AI?, April 2024. URL https://github.blog/2024-04-04-what-is-retrieval-augmented-generation-and-what-does-it-do-for-generative-ai/.
  12. How Cody understands your codebase, February 2024. URL https://sourcegraph.com/blog/how-cody-understands-your-codebase#.
  13. Retrieval-Augmented Generation for Large Language Models: A Survey, March 2024. URL http://arxiv.org/abs/2312.10997. arXiv:2312.10997 [cs].
  14. RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2471–2484, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.151. URL https://aclanthology.org/2023.emnlp-main.151.
  15. Repository-Level Prompt Generation for Large Language Models of Code. In Proceedings of the 40th International Conference on Machine Learning, pages 31693–31715. PMLR, July 2023. URL https://proceedings.mlr.press/v202/shrivastava23a.html. ISSN: 2640-3498.
  16. Sharp Nearby, Fuzzy Far Away: How Neural Language Models Use Context, May 2018. URL http://arxiv.org/abs/1805.04623. arXiv:1805.04623 [cs].
  17. Gregory Kamradt. GitHub - gkamradt/LLMTest_needleinahaystack, April 2024. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack.
  18. Code Llama: Open Foundation Models for Code, January 2024. URL http://arxiv.org/abs/2308.12950. arXiv:2308.12950 [cs].
  19. Pointer Value Retrieval: A new benchmark for understanding the limits of neural network generalization, February 2022. URL http://arxiv.org/abs/2107.12580. arXiv:2107.12580 [cs, stat].
  20. Adaptivity and Modularity for Efficient Generalization Over Task Complexity, October 2023. URL http://arxiv.org/abs/2310.08866. arXiv:2310.08866 [cs].
  21. Compositional Questions Do Not Necessitate Multi-hop Reasoning. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4249–4257, Florence, Italy, 2019. Association for Computational Linguistics. doi: 10.18653/v1/P19-1416. URL https://www.aclweb.org/anthology/P19-1416.
  22. Understanding Dataset Design Choices for Multi-hop Reasoning. In Jill Burstein, Christy Doran, and Thamar Solorio, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4026–4032, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19-1405. URL https://aclanthology.org/N19-1405.
  23. Lost in the Middle: How Language Models Use Long Contexts. Transactions of the Association for Computational Linguistics, 12:157–173, February 2024. ISSN 2307-387X. doi: 10.1162/tacl_a_00638. URL https://doi.org/10.1162/tacl_a_00638.
  24. CoderEval: A Benchmark of Pragmatic Code Generation with Generative Pre-trained Models. In Proceedings of the 46th IEEE/ACM International Conference on Software Engineering, pages 1–13, February 2024. doi: 10.1145/3597503.3623322. URL http://arxiv.org/abs/2302.00288. arXiv:2302.00288 [cs].
  25. In Search of Needles in a 11M Haystack: Recurrent Memory Finds What LLMs Miss, February 2024. URL http://arxiv.org/abs/2402.10790. arXiv:2402.10790 [cs].
  26. LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K, February 2024. URL http://arxiv.org/abs/2402.05136. arXiv:2402.05136 [cs].
  27. Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models, February 2024. URL http://arxiv.org/abs/2402.14848. arXiv:2402.14848 [cs].
  28. RULER: What’s the Real Context Size of Your Long-Context Language Models?, April 2024. URL http://arxiv.org/abs/2404.06654. arXiv:2404.06654 [cs].
  29. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, January 2023. URL http://arxiv.org/abs/2201.11903. arXiv:2201.11903 [cs].
  30. Large Language Models are Zero-Shot Reasoners, January 2023. URL http://arxiv.org/abs/2205.11916. arXiv:2205.11916 [cs].
  31. HuggingFace’s Transformers: State-of-the-art Natural Language Processing, July 2020. URL http://arxiv.org/abs/1910.03771. arXiv:1910.03771 [cs].
  32. Evaluating Large Language Models Trained on Code, July 2021. URL http://arxiv.org/abs/2107.03374. arXiv:2107.03374 [cs].
  33. StarCoder: may the source be with you!, 2023. URL https://arxiv.org/abs/2305.06161. Version Number: 2.
  34. StarCoder 2 and The Stack v2: The Next Generation, February 2024. URL http://arxiv.org/abs/2402.19173. arXiv:2402.19173 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Yannick Assogba (7 papers)
  2. Donghao Ren (9 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets