Papers
Topics
Authors
Recent
Search
2000 character limit reached

Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration

Published 1 Oct 2025 in cs.SE | (2510.01379v1)

Abstract: While LLMs have become the predominant paradigm for automated code generation, current single-model approaches fundamentally ignore the heterogeneous computational strengths that different models exhibit across programming languages, algorithmic domains, and development stages. This paper challenges the single-model convention by introducing a multi-stage, performance-guided orchestration framework that dynamically routes coding tasks to the most suitable LLMs within a structured generate-fix-refine workflow. Our approach is grounded in a comprehensive empirical study of 17 state-of-the-art LLMs across five programming languages (Python, Java, C++, Go, and Rust) using HumanEval-X benchmark. The study, which evaluates both functional correctness and runtime performance metrics (execution time, mean/max memory utilization, and CPU efficiency), reveals pronounced performance heterogeneity by language, development stage, and problem category. Guided by these empirical insights, we present PerfOrch, an LLM agent that orchestrates top-performing LLMs for each task context through stage-wise validation and rollback mechanisms. Without requiring model fine-tuning, PerfOrch achieves substantial improvements over strong single-model baselines: average correctness rates of 96.22% and 91.37% on HumanEval-X and EffiBench-X respectively, surpassing GPT-4o's 78.66% and 49.11%. Beyond correctness gains, the framework delivers consistent performance optimizations, improving execution time for 58.76% of problems with median speedups ranging from 17.67% to 27.66% across languages on two benchmarks. The framework's plug-and-play architecture ensures practical scalability, allowing new LLMs to be profiled and integrated seamlessly, thereby offering a paradigm for production-grade automated software engineering that adapts to the rapidly evolving generative AI landscape.

Summary

  • The paper introduces PerfOrch, a performance-guided orchestration framework that dynamically selects specialized LLMs to generate, fix, and refine code.
  • Methodology employs empirical metrics to assign roles across Generate, Fix, and Refine stages, achieving up to 98.78% correctness on benchmarks.
  • Performance gains include 17.67%-27.66% speedups, demonstrating practical improvements in automated code generation and resource optimization.

Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration

The paper "Beyond Single LLMs: Enhanced Code Generation via Multi-Stage Performance-Guided LLM Orchestration" presents a novel framework for automated code generation using multiple LLMs. Unlike traditional approaches that rely on a single model, this research introduces a multi-stage orchestration strategy that dynamically selects the most suitable LLMs to perform specific coding tasks based on empirical performance metrics.

Framework for Performance-Guided Orchestration

The core of this research is the PerfOrch framework, which systematically orchestrates multiple LLMs through a structured sequence: Generate, Fix, and Refine. This process leverages the heterogeneous strengths of different models across programming languages and code categories, optimizing both correctness and runtime performance.

Generate Stage: Utilizes models that excel in initial code generation. The task is routed based on empirical rankings of LLMs' performance across languages and problem categories.

Fix Stage: Engages specialized models capable of identifying and resolving coding errors, ensuring robustness beyond initial correctness.

Refine Stage: Focuses on runtime performance optimization, leveraging models with strengths in enhancing execution efficiency and resource utilization without compromising existing correctness. Figure 1

Figure 1: Flowchart of multi-stage performance-guided LLM orchestration framework.

Implementation of PerfOrch

PerfOrch is implemented as an LLM agent that integrates these stages into a cohesive framework. The system dynamically profiles LLMs for each task context, ensuring tasks are delegated to models most likely to succeed based on historical data.

Architecture: The agent is structured with Executors for each stage with Memory storing empirical performance metrics and model rankings, enabling a seamless adaptation to rapidly evolving LLM capabilities. Figure 2

Figure 2: The design of PerfOrch, an LLM agent for automated performant code generation.

Empirical Evaluation

The implementation of PerfOrch demonstrates significant improvements over single-model baselines:

  • Correctness Improvements: On benchmarks like HumanEval-X and EffiBench-X, PerfOrch achieved correctness rates of 98.78% and 88.10%, respectively, outperforming single top-tier models like GPT-4o.
  • Performance Gains: PerfOrch showed consistent performance optimizations with median speedups ranging from 17.67% to 27.66% across languages. Figure 3

Figure 3

Figure 3: Pass@1 Pass Rate Comparison

Discussion and Implications

Key Insights

PerfOrch capitalizes on the diverse specialization of LLMs, with the performance enhancement primarily arising from the collaborative orchestration that mitigates the inadequacies of individual models. This not only boosts correctness but also enhances execution efficiency, optimizing both code quality and resource usage.

Design Choice Implications

The structured choice of models for each stage significantly reduces the required computational resources while maximizing the output quality. The sequential acceptance strategy for the Refine stage ensures practical resource management, avoiding the exponential cost of exhaustive model evaluations.

Conclusion

PerfOrch exemplifies the potential of leveraging multiple LLMs through strategic orchestration in real-world coding environments. As software applications demand higher performance and correctness, adopting a multi-stage orchestration framework like PerfOrch can significantly advance automated software engineering. Future work will extend PerfOrch to handle more complex codebase applications and integrate interactive developer feedback loops to refine the orchestration logic. Figure 4

Figure 4: Three solutions of HumanEval-X C++/16, including canonical solution, Claude, and PerfOrch.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 12 likes about this paper.