StyleMBPP: Code Style & Functionality Benchmark
- StyleMBPP is a benchmark designed to assess LLMs’ ability to generate functionally correct code while meeting several Python style constraints.
- It extends the MBPP dataset by incorporating up to six automatically verifiable style instructions per problem for multi-instruction evaluation.
- Experimental results show a marked drop in prompt-level accuracy with increasing style constraints, highlighting challenges for multi-constraint code generation.
Style-aware Mostly Basic Programming Problems (StyleMBPP) is a benchmark designed to rigorously evaluate LLMs’ (LLMs) ability to generate functionally correct code while adhering to multiple simultaneous Python style constraints. Developed to extend the original MBPP dataset’s focus on functional correctness, StyleMBPP introduces up to six concurrent, automatically verifiable style instructions per task, allowing systematic assessment of LLMs’ multi-instruction following under realistic software engineering conditions (Harada et al., 25 Sep 2025).
1. Motivation and Distinction from Existing Benchmarks
The primary motivation for StyleMBPP is the practical necessity in modern software development to enforce not only semantic correctness but also style- and policy-driven requirements such as license notices, indentation norms, and docstring formats. While benchmarks like MBPP (Austin et al., 2021) assess only the ability of models to synthesize code passing specified unit tests, StyleMBPP systematically probes LLMs’ competence to comply with a configurable set of code style constraints simultaneously. This advanced the field by enabling multi-instruction evaluation, a capability not addressed by prior benchmarks.
2. Task Formulation and Benchmark Construction
Each StyleMBPP item consists of a standard MBPP task (i.e., a function description accompanied by assert-based unit tests) augmented with between 1 and 6 style instructions. These instructions are selected from a fixed set of six Python style rules, ensuring that all constraints are independently verifiable. The benchmark construction procedure is as follows:
- For each of the 500 MBPP problems, six prompt variants are generated, each adding a non-conflicting subset of instructions (n = 1…6).
- Style instruction subsets that are conflicting (e.g., mutually exclusive indentation rules) or extremely difficult to satisfy in isolation are excluded.
- All style constraints are defined so as to admit programmatic verification via Pylint or deterministic text-pattern matching.
- The full benchmark totals 3,000 samples (500 tasks × 6 instruction-count strata).
3. Inventory of Style Instructions
The six style instructions within StyleMBPP were selected for prevalence in standard Python style guides and for their ease of automated validation:
| Style Rule | Description | Verification Method |
|---|---|---|
| MIT License notice | File must begin with a standard MIT license header | Header text pattern matching |
| Indentation (2 spaces) | Exactly two spaces per indentation level, no tabs | Indentation analysis |
| Function docstring | Every function requires a triple-quoted docstring | AST parsing and pattern search |
| Conditional comparison | Avoid == True/== False/== None, use idiomatic checks |
Abstract syntax analysis |
| Characters per line | No line may exceed 79 characters | Line length scan |
| Variable name length | All variable names at least three characters long | Source token parsing |
Across the benchmark, randomly sampled, non-overlapping instruction subsets are assigned to each prompt variant, with ranging from 1 to 6.
4. Evaluation Metrics and Protocol
StyleMBPP employs two principal metrics, mirroring the nomenclature introduced in FollowBench:
- Instruction-level Accuracy (Soft Satisfaction Rate): The fraction of individual style instructions satisfied, averaged across all prompts and instructions. This quantifies partial compliance.
- Prompt-level Accuracy (Hard Satisfaction Rate): The fraction of prompts where all style instructions are simultaneously satisfied and all functional unit tests pass. This strict metric reflects real-world “acceptance gate” conditions.
- Functional correctness rate is also reported independently, measuring unit-test pass rate irrespective of style compliance.
All metrics are computed using deterministic checks on code outputs, facilitating scalable and reproducible assessment.
5. Model Evaluation and Experimental Findings
Evaluations were conducted on ten LLMs, covering both closed-source (GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, o3-mini) and open-source (Gemma 2 (2B/9B), Llama 3.1-8B, Qwen 2.5-72B, DeepSeek-V3/R1) models. Critical quantitative findings include:
- Functional test pass rate consistently exceeded 90% across all models, even when six style rules were imposed.
- Prompt-level Accuracy, however, declined markedly with increasing instruction count. For example, GPT-4o showed a drop from approximately 0.93 at to 0.68 at . Gemini 1.5 Pro and Claude 3.5 showed sharper degradation, falling to roughly 0.13 and 0.01 at , respectively.
- Instruction-level Accuracy exhibited a more gradual decline, indicating that LLMs tend to miss isolated instructions rather than fail entirely at multi-constraint prompts.
This systematic performance drop with increasing instruction count highlights a significant challenge for LLM deployment in production software pipelines that enforce multifaceted coding standards.
6. Regression Modeling and Practical Estimation
Given the combinatorial explosion of possible instruction subsets, StyleMBPP incorporates a regression-based estimation framework. A logistic regression model was found most effective in predicting prompt-level accuracy as a function of instruction count, parameterized as:
Trained on 1,458 held-out samples (distinct tasks and three-instruction prompts), this model achieved a mean absolute error of at in predicting out-of-distribution combinations, with Pearson . Training exclusively on prompts with sufficed to predict accuracies within ~0.09 absolute error, demonstrating that modest calibration sets can efficiently characterize model limits under multi-instruction regimes.
7. Significance, Best Practices, and Recommendations
StyleMBPP reveals that contemporary LLMs—while maintaining high functional reliability—face substantial difficulty scaling to simultaneous satisfaction of syntactic and stylistic requirements. Certain style combinations, such as indentation with line-length, are especially prone to induce errors. Experimental results also indicated that models supporting explicit reasoning prompts or chain-of-thought (“CoT”) perform more robustly under increasing instruction load.
The dataset, verifiers, and evaluation scripts are released under CC BY 4.0, enabling reproducibility and extension. Recommended best practices include:
- Using multi-instruction evaluation frameworks like StyleMBPP in addition to functional-only benchmarks to assess readiness for enterprise and regulated settings.
- Applying simple logistic modeling to estimate multi-constraint performance from limited calibration data, reducing the need for exhaustive and costly evaluations.
- Incorporating explicit reasoning or CoT prompting strategies in system design, given observed improvements among reasoning-capable models (Harada et al., 25 Sep 2025).
A plausible implication is that further progress in LLM code generation will require targeted training or specialized architectures to jointly reason over functional and style constraints as instruction count increases.