Impact of advanced elicitation strategies on developer productivity

Ascertain how much employing higher test-time compute and advanced elicitation strategies—such as sampling multiple agent trajectories with LLM judging or self-consistency—affects experienced open-source developers’ time-to-completion on large, mature repositories relative to typical Cursor and web-LLM usage.

Background

Developers primarily used Cursor and web-based LLM interfaces, which typically sample relatively few tokens at inference time. Recent research suggests test-time compute scaling and alternative elicitation (e.g., multiple samples with voting or LLM judges) can materially improve model reliability and usefulness.

The authors did not evaluate such strategies in this field setting, leaving uncertain whether these approaches would change the measured slowdown into a speedup for experienced developers working in large codebases.

References

We do not provide evidence about these elicitation strategies, as developers in our study typically use Cursor and web LLMs like chatGPT, so it remains unclear how much effect these strategies would have on developer productivity in the wild.

Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity (2507.09089 - Becker et al., 12 Jul 2025) in Subsubsection “Suboptimal elicitation,” Section “Factor Analysis” (Factors with unclear effect on slowdown)