An Empirical Study on Strong-Weak Model Collaboration for Repo-level Code Generation

Published 26 May 2025 in cs.AI and cs.SE | (2505.20182v1)

Abstract: We study cost-efficient collaboration between strong and weak LLMs for repository-level code generation, where the weak model handles simpler tasks at lower cost, and the most challenging tasks are delegated to the strong model. While many works propose architectures for this task, few analyze performance relative to cost. We evaluate a broad spectrum of collaboration strategies: context-based, pipeline-based, and dynamic, on GitHub issue resolution. Our most effective collaborative strategy achieves equivalent performance to the strong model while reducing the cost by 40%. Based on our findings, we offer actionable guidelines for choosing collaboration strategies under varying budget and performance constraints. Our results show that strong-weak collaboration substantially boosts the weak model's performance at a fraction of the cost, pipeline and context-based methods being most efficient. We release the code for our work at https://github.com/shubhamrgandhi/codegen-strong-weak-collab.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

Analysis of Strong-Weak Model Collaboration for Repo-Level Code Generation

In this paper, the authors conducted an empirical study on the collaboration between strong and weak LLMs for repository-level code generation, specifically focusing on cost efficiency. The collaborative system is designed so that weak models handle simpler tasks at a lower cost, while more complex tasks are assigned to stronger models. The study evaluates various collaboration strategies—context-based, pipeline-based, and dynamic approaches—applied to GitHub issue resolution.

Key Findings

The authors assert that their most effective collaborative strategy performs comparably to the strong model alone but does so with a 40% reduction in cost. This result highlights significant potential for deploying a cost-effective yet powerful code generation system in practical applications. While prior research has explored strong-weak collaboration in various forms such as LLM cascades and context augmentation, this study systematically analyzes these methods specifically in the context of repository-level code generation.

Methodological Approach

The study examines a diverse range of models and configurations. Models included both proprietary and open-source LLMs. The experiments are performed on the SWE-Bench Lite dataset, which consists of 300 GitHub issues drawn from Python repositories. Agentless Lite is used as the framework, allowing for a judicious use of retrieval-augmented code generation in two steps: retrieving relevant documents and iteratively generating code patches.

Through this systematic analysis, the authors develop a taxonomy of collaborative strategies, examining cost-equated weak-only baselines such as self-consistency, planning, and pipeline methods like Strong LM First and Weak LM First, among others. Each method's performance and costs are evaluated, revealing trends across different strong-weak model pairs.

Implications and Future Directions

The empirical results demonstrated advantages in pipeline and context-based collaboration strategies' average efficiency when managing budget constraints. For practical utilization, the paper suggests adaptability according to specific performance requirements and budget constraints. Notably, methods like Strong LM First and Weak Router showed substantial promise under different budgetary conditions.

The findings have practical implications in software engineering and deployment of LLMs in production environments where cost-efficiency is vital. Moreover, the insights drawn from the study could inform the development of more effective LLM systems capable of handling complex code generation tasks. These principles may be applicable across domains well beyond software engineering, potentially extending to any field requiring nuanced code or text generation.

Conclusion

This study advances the field by providing actionable insights into the cost-performance trade-offs associated with using strong and weak model collaborations in code generation. Future work may look to expand these techniques into other domains and explore more complex architectures overall. The results pave the way for future research aiming to address the diverse challenges of balancing model performance with economic constraints in LLM deployment.

Markdown Report Issue