This paper introduces CodeVisionary, an agent-based framework designed to evaluate the code generation capabilities of LLMs (Wang et al., 18 Apr 2025 ). It addresses the limitations of existing evaluation methods: human evaluation is costly and time-consuming, metric-based methods often require hard-to-obtain reference code or tests, and current LLM-based methods lack access to diverse knowledge sources (like up-to-date documentation, runtime information, or visual feedback) and struggle to comprehend complex code.
CodeVisionary operates in two main stages:
- Multisource Knowledge Analysis Stage: This stage aims to gather comprehensive information needed for evaluation. An LLM agent orchestrates this process through a four-phase cycle:
- Construct: Sets up an isolated, executable environment (using Docker) based on the code generation task and the LLM's response. This includes installing necessary language interpreters, dependencies, and configuration files.
- Comprehend: The agent breaks down the initial code generation task into smaller, specific requirements to better understand the evaluation scope.
- Plan: The agent formulates a step-by-step evaluation plan. Each step has a goal (e.g., "Static Linter Analysis", "Dynamic Execution Analysis") and guidance on the action to take. Possible actions include:
-
Dynamic Execution Analysis
: Running the code (e.g.,python test.py
,gcc test.c && ./output
). -
Static Linter Analysis
: Checking syntax, style, and potential issues using appropriate linters (linter_analysis -f 'path'
). -
Unit Tests Analysis
: Writing and executing unit tests to check functionality and reliability. -
Screenshot Analysis
: Rendering front-end code (e.g., HTML/CSS) into an image and using a multimodal LLM for visual analysis (screenshot_analsis -f 'path' -q 'query'
). -
Interaction Analysis
: Simulating user interactions (clicks, hovers, input) on front-end code before screenshotting (screenshot_analysis -a 'actions'
). -
Web Browsing Analysis
: Searching the web for information, like documentation for new technologies (web_browse -q 'query'
). -
General Semantic Analysis
: Leveraging the LLM's own understanding to evaluate code logic, complexity, security, etc. -
Bash Command
: Performing file system operations (writing files, reading files, etc.).
-
- Analyze: The agent executes the plan step-by-step, interacting with the environment. It alternates between an
Execute State
(performing the planned action) and anAnalyze State
(analyzing the results/observation from the environment and generating a report for that step using predefined templates). Hints are provided to guide the agent based on its current state. The reports from each step are collected.
- Negotiation-based Scoring Stage: To address potential biases and improve the assessment of complex code, this stage employs multiple LLM agents (e.g., 3 judges) who discuss and debate the evaluation.
- Each judge () initially provides a score () and reasoning () based on the information gathered in the first stage and predefined criteria (correctness, functionality, clarity).
- Scores and reasons are shared among judges.
- Judges engage in multiple rounds of discussion (e.g., up to 5 rounds). In each round, a judge can maintain their score, change their score with justification, or query another judge.
- The process terminates when consensus is reached or the maximum number of rounds is exceeded.
- The final evaluation score is the average of the judges' final scores.
Evaluation Report Generation:
CodeVisionary generates a detailed Markdown report summarizing the entire evaluation process. This report includes:
- The original code task and the LLM's response.
- The final evaluation score.
- Details of the environment setup.
- The decomposed task requirements.
- Step-by-step results from the analysis stage (including execution outputs, linter messages, screenshots, test results, etc.).
- The final evaluation reasoning derived from the negotiation stage.
- Optimization suggestions for the evaluated code. The report is automatically formatted using Prettier and can be converted to PDF using Pandoc.
Implementation & Experiments:
- The framework uses an LLM (GPT-4o in experiments) as the controlling agent.
- Interactions involve the agent outputting "thought" (reasoning) and "action" (command to execute).
- Experiments were conducted on a benchmark derived from CodeArena (hard tasks), with responses generated by GPT-3.5-turbo, Claude-3.5-Sonnet, and GPT-4o, and manually scored by experts.
- CodeVisionary significantly outperformed baseline LLM-based evaluators (VANILLA, ICE-Score, CODEJUDGE) on correlation metrics (Pearson, Spearman, Kendall-Tau) against human judgments.
- Ablation studies confirmed the positive impact of both the Multisource Knowledge Analysis and Negotiation-based Scoring stages.
- The framework showed strong performance across various programming languages and coding scenarios, particularly excelling in evaluating UI-related tasks (leveraging Screenshot and Interaction Analysis) and tasks involving newer technologies (leveraging Web Browsing Analysis).
In summary, CodeVisionary provides a structured, automated, and comprehensive approach to evaluating LLM-generated code. By integrating external tools, multi-source knowledge gathering, and a multi-agent negotiation process, it aims to produce more accurate, reliable, and interpretable evaluations compared to existing methods, complete with detailed reports useful for developers.