Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 85 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 16 tok/s Pro

GPT-5 High 14 tok/s Pro

GPT-4o 94 tok/s Pro

Kimi K2 221 tok/s Pro

GPT OSS 120B 464 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

CodeIF-Bench: Evaluating Instruction-Following Capabilities of Large Language Models in Interactive Code Generation (2503.22688v2)

Published 5 Mar 2025 in cs.SE, cs.PL, and cs.AI

Abstract: LLMs have demonstrated exceptional performance in code generation tasks and have become indispensable programming assistants for developers. However, existing code generation benchmarks primarily assess the functional correctness of code generated by LLMs in single-turn interactions, offering limited insight into their capabilities to generate code that strictly follows users' instructions, especially in multi-turn interaction scenarios. In this paper, we introduce CodeIF-Bench, a benchmark for evaluating LLMs' instruction-following capabilities in interactive code generation. Specifically, CodeIF-Bench incorporates nine types of verifiable instructions aligned with the real-world software development requirements, which can be independently and objectively validated through specified test cases, facilitating the evaluation of instruction-following capability in multi-turn interactions. We evaluate nine prominent LLMs using CodeIF-Bench, and the experimental results reveal a significant disparity between their basic programming capability and instruction-following capability, particularly as task complexity, context length, and the number of dialogue rounds increase.