Determine LLM performance on NoCode-bench with scaffolds beyond Agentless and OpenHands
Determine the performance of the models evaluated in this study—DeepSeek-v3, DeepSeek-R1, Qwen3-235B, Claude-4-Sonnet, GPT-4o, and Gemini-2.5-Pro—on NoCode-bench when using software engineering scaffolds other than Agentless and OpenHands, to assess how alternative scaffolds affect success on natural language-driven feature addition tasks.
References
We evaluate the performance of the selected LLMs using only the Agentless and OpenHands scaffolds. Their performance on other scaffolds remains unknown.
— NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition
(2507.18130 - Deng et al., 24 Jul 2025) in Threats to Validity — External Validity: Used Scaffolds