CodeAlignBench: Developer-Aligned Code Benchmark

Updated 3 November 2025

CodeAlignBench is a multi-language benchmark that evaluates LLMs' ability to follow nuanced, developer-authored instructions in code generation.
It employs a systematic pipeline using developer-sourced instructions to assess structural, semantic, and cosmetic code refinements across Python, Java, and JavaScript.
Its automated verification using rule-based and LLM-driven methods highlights strengths in modularization while revealing challenges in semantic and stylistic alignment.

CodeAlignBench is a multi-language, developer-aligned benchmark for evaluating instruction-following capabilities of LLMs in code generation. Unlike traditional functional correctness metrics, CodeAlignBench directly assesses a model’s ability to implement nuanced, real-world software refinements based on developer preferences and iterative instructions, spanning multiple programming languages and a diverse taxonomy of coding adjustments.

1. Motivation for Developer-Aligned Evaluation

The proliferation of high-performance LLMs for code generation has outpaced the sophistication of existing benchmarks, which predominantly measure functional correctness—whether the generated code solves a given problem as specified. However, functional equivalence alone is insufficient for replicating practical development workflows, where code readability, style, maintainability, and adherence to user-specified constraints are crucial. To address this deficit, CodeAlignBench was designed to fill the methodological gap, providing a systematic means to evaluate instruction-following (IF) performance in code generation, particularly as it relates to developer-authored preferences and post hoc adjustments.

2. Benchmark Architecture and Instruction Taxonomy

CodeAlignBench evaluates adherence to both initial constraints (predefined instructions) and iterative, developer-authored refinements (follow-up instructions). The benchmark construction follows these key steps:

Task Set: Originates from LiveBench (functionally correct code variants derived from LeetCode and AtCoder).
Language Coverage: Encompasses Python (original), Java, and JavaScript using an automated translation and evaluation pipeline.
Developer-Sourced Instruction Collection: Thirty professional developers compared functionally equivalent code solutions and authored natural language instructions to modify less-preferred code into more desirable forms. This resulted in a validated catalog of 228 instructions, covering:
- Cosmetic: comments, naming conventions, code style.
- Structural: modularization, elimination of duplication, organization.
- Semantic
- Algorithmic changes.
- Performance optimizations.
- Correctness fixes.

Instructions were rigorously categorized via human open coding and further refined with LLM assistance to ensure language-agnostic applicability.

3. Automated Benchmarking Pipeline

The pipeline is designed for modularity, extensibility, and rapid experimentation:

Task Construction: For each code sample, relevant instruction categories are determined via an applicability checker, yielding only those tasks where specific instructions are pertinent.
Evaluation Scenarios: IF is assessed in two modes—
- Predefined: Instructions provided upfront in the initial prompt.
- Follow-up: Instructions introduced after code generation for iterative adjustment.
Verification Module: Each instruction is paired with a verification function (verify), which may be rule-based or LLM-driven (LLM-as-judge). This module determines whether the instruction has been correctly incorporated into the code output.

This architecture supports automated, language-agnostic expansion, enabling the benchmark to scale to new tasks and languages with minimal overhead. It is lightweight and executable on standard computational infrastructure, and intentionally minimizes evaluation contamination by leveraging LiveBench’s data isolation protocols.

4. Empirical Evaluation Protocol

CodeAlignBench was used to benchmark 10 LLMs from OpenAI (GPT family), Claude (Sonnet), and Gemini. Evaluations consisted of:

Model Input: For each sample, models received either the problem and instruction (predefined) or were asked to modify existing correct code via a new instruction (follow-up).
Metrics:
- Success Rate: Fraction of tasks in which the model’s output satisfied the instruction, as judged by the applicable verification strategy.
- Statistical Analysis: Wilcoxon signed-rank test for within-model setting comparison (e.g., predefined vs. follow-up), Friedman tests (with Kendall’s $W$ ) for between-instruction type performance.
- LLM-as-Judge Reliability: Human–LLM agreement quantified via Cohen’s kappa and accuracy; empirical findings indicated $87\%$ agreement for accomplishment of instructions.

A summary table consolidates benchmark characteristics:

Feature	Detail
Task Source	LiveBench (LeetCode/AtCoder) + developer input
Languages	Python, Java, JavaScript
Instruction Catalog	228 verified, developer-written instructions
Types	Cosmetic, Structural, Semantic (Algorithmic, Performance, Correctness)
Evaluation Scenarios	Predefined and Follow-up
Verification	Rule-based / LLM-as-judge
Metrics	Success rate, Wilcoxon/Friedman, Cohen’s kappa
Key Findings	Follow-up $\gg$ Predefined; Structural $>$ Semantic $>$ Cosmetic; all models leave headroom
Unique Aspects	Real, developer-authored instructions; multi-language support; modular/extensible

5. Key Results and Instruction-Following Analysis

Follow-up Instructions: Models demonstrated substantially higher IF success when instructions were provided after initial code generation. Median success rate difference in Python was $\approx 0.18$ ( $p<0.01$ ), with similarly elevated gains in Java and JavaScript. This suggests LLMs process incremental, focused changes more effectively than composite instructions presented upfront.
Instruction Type: Structural adjustments (e.g., modularization, code deduplication) yielded the highest compliance across all models and languages. Semantic or cosmetic changes (algorithm switching, performance tuning, stylistic refinements) posed greater challenges, with statistically significant deficits.
Language and Model Variation: Superior performance was seen in the most recent model releases (GPT-5/mini, Claude Sonnet 4) but improvements were incremental, and no model fully captured all nuanced developer-aligned objectives. Failures in semantic and cosmetic IF were consistent across language boundaries.
LLM-as-Judge: Automated verification using state-of-the-art LLMs aligned well with human rater judgments, facilitating scalable, auto-validated evaluation.

Statistical testing includes:

Wilcoxon signed-rank test: e.g., median diff = $0.181$, $p < 0.01$
Friedman test: e.g., across Python, $\chi^{2}(2) = 12.20$ , $p = 0.0022$ , Kendall’s $W = 0.61$

6. Implications for Model Development and Evaluation

The benchmark exposes clear limitations in current code generation approaches:

Structural instructions (Editor’s term: "refactor-class") are tractable, but even state-of-the-art LLMs fall short on fulfilling compound, algorithmic, or stylistic developer instructions.
Contextual, Interactive Prompting is materially more effective for model alignment than initial, composite instruction prompts.
Cross-Language Generalizability: Model weaknesses are broadly consistent across supported programming languages, underscoring fundamental gaps in IF capability that are independent of syntax or semantics particularities.
Evaluation Framework: CodeAlignBench provides a modular, extensible template for future benchmarking and LLM-oriented research, supporting rapid iteration and expansion to new languages and developer preference types.

A plausible implication is that as LLMs improve, benchmarks like CodeAlignBench will remain critical for tracking progress in practical instruction alignment rather than mere functional correctness.

Relative to prior work, CodeAlignBench is most closely related to LiveBench and methodologies involving developer-aligned code adjustments. Distinctively:

Real Developer Instructions: Instructions are empirically sourced from active software developers, not artificially synthesized.
Multi-Language Breadth: Benchmark supports and is extensible to Python, Java, JavaScript, with translation-driven pipeline.
Instruction-Specific Verification: Adoption of rule-based and LLM-driven verification functions for diverse adjustment categories.
Iterative, Interactive Evaluation: Direct measurement of stepwise, real-world developer alignment, exposing LLM IF strengths and weaknesses not observed in traditional benchmarks.

CodeAlignBench advances the scope of code LLM benchmarking by incorporating real-world, developer-preferred refinements and presents a rigorous platform for quantifying and improving nuanced instruction adherence across evolving model generations.

In conclusion, CodeAlignBench systematically benchmarks the capacity of code generation models to apply developer-authored instructions across languages and adjustment categories, establishing robust empirical methodologies, highlighting model strengths in modularization and iterative refinement, and exposing persistent challenges in higher-order reasoning and stylistic alignment. Its extensible framework and empirical data-driven approach provide an authoritative foundation for aligning LLM outputs with diverse, real-world developer expectations (Mehralian et al., 31 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

CodeAlignBench: Assessing Code Generation Models on Developer-Preferred Code Adjustments (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CodeAlignBench.