Analysis of Strong-Weak Model Collaboration for Repo-Level Code Generation
In this paper, the authors conducted an empirical study on the collaboration between strong and weak LLMs for repository-level code generation, specifically focusing on cost efficiency. The collaborative system is designed so that weak models handle simpler tasks at a lower cost, while more complex tasks are assigned to stronger models. The study evaluates various collaboration strategies—context-based, pipeline-based, and dynamic approaches—applied to GitHub issue resolution.
Key Findings
The authors assert that their most effective collaborative strategy performs comparably to the strong model alone but does so with a 40% reduction in cost. This result highlights significant potential for deploying a cost-effective yet powerful code generation system in practical applications. While prior research has explored strong-weak collaboration in various forms such as LLM cascades and context augmentation, this study systematically analyzes these methods specifically in the context of repository-level code generation.
Methodological Approach
The study examines a diverse range of models and configurations. Models included both proprietary and open-source LLMs. The experiments are performed on the SWE-Bench Lite dataset, which consists of 300 GitHub issues drawn from Python repositories. Agentless Lite is used as the framework, allowing for a judicious use of retrieval-augmented code generation in two steps: retrieving relevant documents and iteratively generating code patches.
Through this systematic analysis, the authors develop a taxonomy of collaborative strategies, examining cost-equated weak-only baselines such as self-consistency, planning, and pipeline methods like Strong LM First and Weak LM First, among others. Each method's performance and costs are evaluated, revealing trends across different strong-weak model pairs.
Implications and Future Directions
The empirical results demonstrated advantages in pipeline and context-based collaboration strategies' average efficiency when managing budget constraints. For practical utilization, the paper suggests adaptability according to specific performance requirements and budget constraints. Notably, methods like Strong LM First and Weak Router showed substantial promise under different budgetary conditions.
The findings have practical implications in software engineering and deployment of LLMs in production environments where cost-efficiency is vital. Moreover, the insights drawn from the study could inform the development of more effective LLM systems capable of handling complex code generation tasks. These principles may be applicable across domains well beyond software engineering, potentially extending to any field requiring nuanced code or text generation.
Conclusion
This study advances the field by providing actionable insights into the cost-performance trade-offs associated with using strong and weak model collaborations in code generation. Future work may look to expand these techniques into other domains and explore more complex architectures overall. The results pave the way for future research aiming to address the diverse challenges of balancing model performance with economic constraints in LLM deployment.