Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ConCodeEval: Evaluating Large Language Models for Code Constraints in Domain-Specific Languages (2407.03387v2)

Published 3 Jul 2024 in cs.SE, cs.AI, and cs.CL

Abstract: Recent work shows LLMs struggle to understand natural language constraints for various text generation tasks in zero- and few-shot settings. While, in the code domain, there is wide usage of constraints in code format to maintain the integrity of code written in Domain-Specific Languages (DSLs) like JSON and YAML which are widely used for system-level programming tasks in enterprises. Given that LLMs are increasingly used for system-level code tasks, evaluating if they can comprehend these code constraints is crucial. However, no work has been done to evaluate their controllability over code constraints. Hence, we introduce ConCodeEval, a first-of-its-kind benchmark having two novel tasks for code constraints across five representations. Our findings suggest that LLMs struggle with code constraints. Code languages that perform excellently for normal code tasks do not perform well when the same languages represent fine-grained constraints.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
  1. Program synthesis with large language models. Preprint, arXiv:2108.07732.
  2. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  3. Multipl-e: A scalable and extensible approach to benchmarking neural code generation. Preprint, arXiv:2208.08227.
  4. Evaluating large language models trained on code. Preprint, arXiv:2107.03374.
  5. Cheng-Han Chiang and Hung-yi Lee. 2023. Can large language models be an alternative to human evaluations? In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15607–15631, Toronto, Canada. Association for Computational Linguistics.
  6. Are large language model-based evaluators the solution to scaling up multilingual evaluation? In Findings of the Association for Computational Linguistics: EACL 2024, pages 1051–1070, St. Julian’s, Malta. Association for Computational Linguistics.
  7. Measuring coding challenge competence with apps. Preprint, arXiv:2105.09938.
  8. Ds-1000: A natural and reliable benchmark for data science code generation. Preprint, arXiv:2211.11501.
  9. Configuration validation with large language models. Preprint, arXiv:2310.09690.
  10. Octopack: Instruction tuning code large language models. Preprint, arXiv:2308.07124.
  11. Inderjeet Nair and Natwar Modani. 2023. Exploiting language characteristics for legal domain-specific language model pretraining. In Findings of the Association for Computational Linguistics: EACL 2023, pages 2516–2526, Dubrovnik, Croatia. Association for Computational Linguistics.
  12. Multitask prompted training enables zero-shot task generalization. Preprint, arXiv:2110.08207.
  13. Lightpaff: A two-stage distillation framework for pre-training and fine-tuning. Preprint, arXiv:2004.12817.
  14. Evaluating large language models on controlled generation tasks. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3155–3168, Singapore. Association for Computational Linguistics.
  15. Gemini: A family of highly capable multimodal models. Preprint, arXiv:2312.11805.
  16. Chain-of-thought prompting elicits reasoning in large language models. Preprint, arXiv:2201.11903.
  17. Learning to mine aligned code and natural language pairs from stack overflow. In International Conference on Mining Software Repositories, MSR, pages 476–486. ACM.

Summary

We haven't generated a summary for this paper yet.