MultiOOP Benchmark for Multilingual OOP Evaluation
- The paper introduces a multilingual benchmark that evaluates LLMs on complex OOP tasks such as class design, inheritance, and encapsulation.
- It employs a bespoke translation mechanism to generate semantically equivalent tasks and tests across Python, PHP, C++, C#, Java, and JavaScript.
- The evaluation uses novel metrics like pass@o to quantify both functional and semantic fidelity, revealing significant cross-language performance gaps.
The MultiOOP benchmark is a multilingual, object-oriented programming (OOP) benchmark specifically constructed to rigorously evaluate LLMs on generative code tasks that require comprehension and implementation of OOP constructs across six major programming languages: Python, PHP, C++, C#, Java, and JavaScript. Uniquely, MultiOOP targets class-based, method-driven problem formulations, addressing substantial gaps in prior benchmarks—chiefly their single-language bias, predominance of function- or statement-level tasks, and insufficient test coverage. MultiOOP systematically extends previous benchmarks, introducing advanced translation, metric extensions, and a robust evaluation pipeline, thereby enabling fair, diagnostic, and cross-lingual assessments of LLMs' ability to generate semantically precise, structurally faithful object-oriented software.
1. Benchmark Construction and Scope
MultiOOP was developed to enable nuanced, cross-language assessment of LLMs' OOP proficiency. The dataset originates from a Python-based OOP benchmark comprising 431 tasks, from which 267 were selected to ensure that each could be semantically and syntactically preserved across all six targeted languages. Tasks selected for inclusion rigorously test OOP paradigms—class creation, inheritance hierarchies, encapsulation, method overriding, and other canonical patterns.
The selection process filtered out constructs that map poorly or ambiguously between languages (such as Pythonic list comprehensions or multiple inheritance), thereby enforcing maximum cross-lingual comparability and semantic equivalence. This careful curation ensures that generated solutions can be evaluated according to identical specifications, regardless of the surface-level syntax or language-specific constraints.
2. Cross-Language Task Translation
A bespoke automatic translator forms the backbone of MultiOOP's multilingual infrastructure. This tool traverses the canonical Python implementation and its associated unit tests, then emits semantically equivalent tasks and corresponding tests in PHP, C++, C#, Java, and JavaScript. Crucial translation challenges addressed include:
- Function and method declaration mapping (e.g., Python
defto C++/Java method signatures). - Class and member accessibility (translation of
public,private, etc.). - Object instantiation and method invocation semantics.
- Inheritance and polymorphism expressions across static and dynamic languages.
The translation mechanism guarantees that class/interface names, method names, and required object interactions are faithfully mirrored across languages. The outcome is a set of isomorphic OOP tasks, enabling the disambiguation of genuine reasoning or generalization failures from mere linguistic or syntactic mismatches.
3. Evaluation Metrics and Their Formalization
Traditional code generation benchmarks predominantly use the pass@k metric, which counts the success rate when at least one out of k generated samples passes all test cases. MultiOOP generalizes this with the introduction of the pass@o metric, which mandates not only passing all test cases but also satisfying explicit matching of OOP-specific requirements extracted from the natural language prompt (e.g., presence of specified class and method names).
The respective formalizations, as provided in the canonical paper, are:
- pass@k:
where is the number of generated samples, is the count passing all tests, and denotes the mean across problems.
- pass@o:
where denotes the number of samples passing both the test suite and the semantic matching requirements.
The additional matching requirement in pass@o ensures adherence to prescribed OOP design, thereby penalizing code that is functionally correct but structurally or semantically deficient.
4. Automated Test Case Augmentation
Addressing a major shortcoming of previous benchmarks, which typically offered limited test coverage (often fewer than ten test cases per problem), MultiOOP incorporates an automated test augmentation pipeline to bolster evaluation rigor. The augmentation proceeds in three phases:
- Test Generation: The LLM (e.g., GPT-4o) is prompted with the reference implementation and example tests, producing candidate test cases.
- Test Validation: Each candidate is executed on the ground-truth code; only passing and valid cases are retained, with approximately 18 unique positive assertions per task being typical.
- Cross-Language Translation: Validated tests are programmatically rendered in each of the six target languages using the main translation engine.
Branch coverage analysis confirms that increasing the test count (to roughly 15–20 per task) substantially reduces accidental passes and enhances discriminative power, leading to more accurate performance characterization.
5. Experimental Evaluation and Key Findings
Evaluation of 14 contemporary LLMs under zero-shot prompting on MultiOOP revealed three salient patterns:
- Performance Degradation on OOP Tasks: Pass@1 scores for MultiOOP are up to 65.6 percentage points lower than on function-level benchmarks like HumanEval, indicating that object-oriented code generation is significantly more challenging.
- Cross-Language Variability: A marked disparity in model generalization exists across languages. For example, GPT-4o mini achieved 48.06% pass@1 in Python but only 0.12%–15.26% in PHP, C++, C#, Java, and JavaScript under identical conditions. This demonstrates limited robustness or transfer for OOP concepts beyond dominant training languages.
- Conceptual Fidelity Gap: The pass@o metric scores are consistently 1.1–19.2 points lower than pass@k, evidencing that LLMs, while often producing runnable code, do not reliably instantiate the intended OOP abstractions as specified in the natural language requirements.
A summary table (values directly as claimed in the data):
| Language | pass@1 (GPT-4o mini) |
|---|---|
| Python | 48.06% |
| Others | 0.12%–15.26% |
This suggests that advances in multilingual and OOP-aware code synthesis remain necessary to bridge observed deficiencies.
6. Dataset Availability and Community Impact
The MultiOOP benchmark, including all curated tasks, pass@o metric implementation, translation infrastructure, and evaluation scripts, is publicly released to foster reproducible, community-driven advancement in LLM-based code synthesis. The benchmark and related tools are accessible at:
- GitHub: https://github.com/alphadl/OOP-eval
- HuggingFace Datasets: https://huggingface.co/datasets/codeai-dteam/MultiOOP
By combining cross-language parity, OOP structural scrutiny, and comprehensive test coverage, MultiOOP provides an essential platform for diagnosing and improving the OOP abstraction ability and multilingual versatility of next-generation code LLMs.
7. Significance and Research Outlook
MultiOOP directly addresses the deficits of previous evaluation regimes characterized by single-language focus, granular code completion (function-level or below), and meager test coverage. It establishes a robust, scalable standard for measuring LLMs’ comprehension and instantiation of critical OOP paradigms in real-world programming environments. The observed gaps—substantial multilingual performance drop and persistent OOP concept mismatches—highlight target areas for future model development and benchmark refinement. A plausible implication is that progress in LLM architecting, multilingual training, and explicit OOP grounding will be measurable via improved, discriminative performance on MultiOOP, thereby guiding the field toward models exhibiting practical and conceptual mastery of diverse software engineering tasks (Wang et al., 30 Sep 2025).