Determine LLM performance on NoCode-bench with scaffolds beyond Agentless and OpenHands

Determine the performance of the models evaluated in this study—DeepSeek-v3, DeepSeek-R1, Qwen3-235B, Claude-4-Sonnet, GPT-4o, and Gemini-2.5-Pro—on NoCode-bench when using software engineering scaffolds other than Agentless and OpenHands, to assess how alternative scaffolds affect success on natural language-driven feature addition tasks.

Background

The paper evaluates six state-of-the-art LLMs for natural language-driven feature addition using two scaffolds: Agentless (pipeline-based) and OpenHands (agent-based). These scaffolds represent distinct paradigms for automating software engineering tasks and have shown strong results on related benchmarks such as SWE-bench.

However, the authors explicitly note that they only used Agentless and OpenHands in their experiments, and that how the same models would perform under other scaffolds remains unknown. This raises an unresolved question about the impact of alternative tooling and orchestration frameworks on performance in the NoCode-bench setting.

References

We evaluate the performance of the selected LLMs using only the Agentless and OpenHands scaffolds. Their performance on other scaffolds remains unknown.

— NoCode-bench: A Benchmark for Evaluating Natural Language-Driven Feature Addition (2507.18130 - Deng et al., 24 Jul 2025) in Threats to Validity — External Validity: Used Scaffolds

Determine LLM performance on NoCode-bench with scaffolds beyond Agentless and OpenHands

Sponsor

Background

References

Related Problems