Do agentic coding workflows mitigate known LLM library-related failures or reproduce them at scale

Determine whether large-language-model-based agentic coding workflows, such as those used by tools like Claude Code, Cursor, Devin, Copilot, and OpenAI Codex when authoring pull requests, mitigate known LLM library-related failure modes—including hallucinated package names, deprecated API usage, and omission of version constraints—or whether these behaviors persist and are reproduced at scale.

Background

The paper motivates its study by noting prior non-agentic LLM issues with libraries, such as hallucinating package names, calling deprecated APIs, and neglecting version constraints. It highlights that agentic systems have autonomy to act within projects and may bypass additional retrieval infrastructure.

The authors explicitly state uncertainty about whether agentic workflows will actually use external libraries and, if they do, whether this autonomy mitigates known problems or reproduces them at scale. While their results address usage frequency and versioning practices, they do not fully resolve whether agentic setups avoid hallucinations or deprecated API usage, leaving this question open.

References

What remains unclear is whether agentic workflows will actually use external libraries—and if they do, whether this autonomy helps mitigate existing problems or simply reproduces them at scale.

— A Study of Library Usage in Agent-Authored Pull Requests (2512.11589 - Twist, 12 Dec 2025) in Section 1 (Introduction)

Do agentic coding workflows mitigate known LLM library-related failures or reproduce them at scale

Sponsor

Background

References

Related Problems