From Helpful to Trustworthy: LLM Agents for Pair Programming

Published 11 Apr 2026 in cs.SE and cs.AI | (2604.10300v1)

Abstract: LLM-based coding agents are increasingly used to generate code, tests, and documentation. Still, their outputs can be plausible yet misaligned with developer intent and provide limited evidence for review in evolving projects. This limits our understanding of how to structure LLM pair-programming workflows so that artifacts remain reliable, auditable, and maintainable over time. To address this gap, this doctoral research proposes a systematic study of multi-agent LLM pair programming that externalizes intent and uses development tools for iterative validation. The plan includes three studies: translating informal problem statements into standards aligned requirements and formal specifications; refining tests and implementations using automated feedback, such as solver-backed counterexamples; and supporting maintenance tasks, including refactoring, API migrations, and documentation updates, while preserving validated behavior. The expected outcome is a clearer understanding of when multi-agent workflows increase trust, along with practical guidance for building reliable programming assistants for real-world development.

Abstract PDF Upgrade to Chat

Authors (1)

Ragib Shahariar Ayon

Summary

The paper proposes a driver-navigator multi-agent architecture that integrates machine-verifiable feedback to systematically enhance code artifact correctness.
It uses automated verifiers, such as SMT solvers, to generate formal specifications and achieves over 25% reduction in evaluation time compared to prior workflows.
Empirical results demonstrate significant improvements in artifact completeness and success, establishing a new benchmark for trustworthy LLM-assisted development.

LLM Agents for Trustworthy Pair Programming: A Technical Analysis

Introduction

The proliferation of LLM-based coding agents has transformed the software engineering landscape, enabling impressive performance in code generation, repair, and interactive assistance. However, persistent challenges remain regarding the trustworthiness and auditability of these agents’ outputs, particularly as they evolve into more autonomous roles in real-world development workflows. "From Helpful to Trustworthy: LLM Agents for Pair Programming" (2604.10300) addresses this critical issue by proposing a systematic, evidence-driven approach to multi-agent LLM pair programming, emphasizing intent externalization, machine-verifiable feedback, and preservation of validated behavior throughout software evolution.

Multi-Agent LLM Design: Driver-Navigator Architecture

The central proposal of the work is a multi-agent system (MAS) structured around the classic driver-navigator paradigm, repurposed for LLM agents. In this architecture, the driver agent synthesizes artifacts—including code, tests, and specifications—while the navigator agent critiques these proposals. Interaction histories remain role-specific, but both agents share a persistent project context. This design is motivated by prior empirical evidence demonstrating quality improvements from such division of labor [zhang2024pair].

A critical component is the navigator’s constraint to produce machine-checkable contracts and formal specifications rather than unconstrained, free-form assessments. Automated verifiers (for example, SMT solvers or program analyzers) are used to validate navigator outputs, providing proofs or counterexamples. This approach shifts the locus of trust from subjective model agreement to deterministic, auditable validation, with human developers only required to confirm that formal specifications faithfully capture intent.

Empirical Results on Specification Synthesis

The research provides strong numerical evidence that verifier-guided LLM-based specification generation significantly enhances artifact correctness and coverage. The previously introduced AutoReSpec system, operating as a collaborative LLM ensemble guided by verifier error feedback, achieves 58.2% success and 69.2% completeness on a set of 72 verification tasks, outperforming single-agent LLM baselines and reducing evaluation time by more than 25% compared to earlier workflows.

AutoJML further extends these results with integration of ReAct-based planning and context retrieval. In rigorous evaluation on a 120-program benchmark, AutoJML verifies 109 programs (79.3% completeness), with superior results on multifaceted control-flow programs, including 81.48% and 85.71% success on multi-path and nested loops, respectively. These benchmarks establish the efficacy of multi-agent, feedback-driven LLM workflows for both synthesis and refinement of structured artifacts.

Workflow Extensions: Test Generation, Maintenance, and Evolution

Building on the evidence base, the research agenda expands the driver-navigator workflow to end-to-end software lifecycle tasks:

Requirements and Specification Elicitation: Translating informal problem statements into standards-aligned requirements and executable formal specifications, bridging gaps found in previous systems that handle these steps in isolation [han-etal-2024-archcode, specgen].
Test and Code Refinement with Automated Feedback: Leveraging solver-backed counterexamples, driver-navigator LLM pairs iteratively refine artifacts. Exposure of such counterexamples as first-class artifacts aims to enhance auditability, trust, and reproducibility of reported failures.
Maintenance Tasks Anchored in Formal Constraints: Automated refactoring, API migration, and documentation updates are validated against existing specifications and tests, preventing behavioral regressions—a dimension largely unexplored by prior approaches to LLM-assisted maintenance [releasedeploymentAbreu, codereviewtufano].

The plan explicitly seeks to answer foundational questions about the efficacy of these workflows, especially in producing artifacts that are not merely helpful but reliably consistent with verified requirements and resilient to software evolution. Metrics considered include pass rates, reproducibility, maintenance regression prevention, and docstring factuality and bias.

Implications and Future Directions

This research has significant practical and theoretical implications. It establishes that trust in LLM-based programming is not a function solely of raw generative accuracy, but of tightly integrated agentic workflows that expose intent, use machine-verifiable evidence, and enforce behavioral constraints throughout the artifact lifecycle. The explicit separation of proposal and critique roles, operationalized through deterministic feedback, offers a scalable blueprint for trustworthy LLM deployment in real software engineering settings.

Notably, the emphasis on externalizing intent and auditing agent proposals aligns with emergent best practices in agentic software engineering [roychoudhury2025agentic], where interpretability and evidence-driven validation are increasingly recognized as paramount. The methodology provides a foundation for further research on optimal division of agent roles, tuning of feedback signals, transferability across programming languages, and extension to other agent-enforced quality standards beyond correctness (e.g., security or resource utilization).

The future trajectory of this line of work will likely include integration with broader software ecosystems (IDEs, CI/CD pipelines), increased heterogeneity of agent architectures, and deeper interleaving of formal methods and language modeling—potentially diminishing the residual gap between LLM output plausibility and practical trust.

Conclusion

"From Helpful to Trustworthy: LLM Agents for Pair Programming" (2604.10300) advances a principled blueprint for developing reliable, audit-ready LLM coding agents through structured multi-agent workflows anchored in formal specification, automated feedback, and maintenance-aware iteration. The approach yields substantial empirical improvements in artifact correctness and coverage, while providing a concrete agenda for extending LLM trustworthiness to encompass both artifact generation and longitudinal software evolution. This work sets a technical foundation for automated development frameworks where trust is built on externalized evidence and persistent behavioral constraints, reshaping the prospects for LLM agents as credible collaborators in software engineering.

Markdown Report Issue