IndustryCode: A Benchmark for Industry Code Generation

Published 3 Apr 2026 in cs.SE, cs.AI, and cs.CL | (2604.02729v1)

Abstract: Code generation and comprehension by LLMs have emerged as core drivers of industrial intelligence and decision optimization, finding widespread application in fields such as finance, automation, and aerospace. Although recent advancements have demonstrated the remarkable potential of LLMs in general code generation, existing benchmarks are mainly confined to single domains and languages. Consequently, they fail to effectively evaluate the generalization capabilities required for real-world industrial applications or to reflect the coding proficiency demanded by complex industrial scenarios. To bridge this gap, we introduce IndustryCode, the first comprehensive benchmark designed to span multiple industrial domains and programming languages. IndustryCode comprises 579 sub-problems derived from 125 primary industrial challenges, accompanied by rigorous problem descriptions and test cases. It covers a wide range of fields, including finance, automation, aerospace, and remote sensing-and incorporates diverse programming languages such as MATLAB, Python, C++, and Stata. In our evaluation, the top-performing model, Claude 4.5 Opus, achieved an overall accuracy of 68.1% on sub-problems and 42.5% main problems. The benchmark dataset and automated evaluation code will be made publicly available upon acceptance.

Abstract PDF Upgrade to Chat

Authors (10)

Summary

The paper presents IndustryCode, a benchmark that decomposes 125 industrial problems into 579 sub-problems for robust code generation evaluation.
It details a rigorous annotation pipeline combining manual revisions, hierarchical decomposition, and both automated and LLM-based validation.
Performance analysis reveals a significant execution gap between isolated sub-problem accuracy and integrated industrial task synthesis.

IndustryCode: A Comprehensive Benchmark for Industrial-Scale Code Generation

Motivation and Benchmark Design Principles

IndustryCode introduces a multi-domain, multi-language benchmark explicitly designed to assess code generation capabilities in authentic industrial settings. The benchmark is motivated by the observed limitations of existing code generation benchmarks, which predominantly focus on single domains, mainstream programming languages (e.g., Python), and narrowly scoped software tasks. These benchmarks fail to capture the multifaceted requirements of industrial contexts—including domain specificity, cross-lingual generalization, and high-complexity engineering tasks that feature tightly coupled sub-components and stringent numerical precision.

IndustryCode addresses this gap by aggregating 125 main industrial problems, each hierarchically decomposed into 579 granular sub-problems. The dataset spans 20 subfields across four languages: Python, C++, MATLAB, and Stata. Main problems correspond to holistic industrial engineering tasks that require end-to-end synthesis, while sub-problems capture modular, functionally cohesive subcomponents amenable to focused evaluation.

Figure 1: Hierarchical decomposition of an IndustryCode task. A complex Main Problem is factorized into modular Sub-problems, each with explicit specifications and dependencies.

A distinguishing attribute is the exclusivity of data sourcing from proprietary, high-fidelity industrial codebases, followed by manual decontamination and complexity enhancement to prevent artifact leakage and ensure the presence of authentic, production-grade idiosyncrasies. The curation includes numerical test cases for execution-based validation and a semantic LLM-Judge system for evaluating code quality beyond simple I/O matching.

Dataset Construction and Annotation Pipeline

The construction pipeline starts with the collection of domain-representative industrial problems, emphasizing physical sciences, quantitative finance, engineering optimization, and scientific computing verticals. The annotation process proceeds through the following phases:

Manual Revision & Difficulty Enhancement: Each problem is manually re-stated to prevent code memorization, injected with complex mathematical, architectural, and algorithmic constructs, and subjected to rigorous adversarial filtering.
Hierarchical Decomposition: Main problems are systematically partitioned into sharply-defined sub-problems, matching real-world module boundaries.
Validation Protocol: Automated type-checking and execution are coupled with iterative human and LLM-based review to maintain both mathematical solvability and industrial realism.
Figure 2: Data Annotation flowchart. The annotation pipeline combines expert revision, LLM assistance, and automated verification for maximum data fidelity.

The resulting benchmark exhibits broad coverage: Python dominates in engineering, AI, and finance, while C++ is targeted toward manufacturing and IT, MATLAB covers optimization and scientific modeling, and Stata supports statistical computing and domain-specific analytics.

Evaluation Methodology and Experimental Analysis

To assess model competency, IndustryCode adopts a unified zero-shot prompting strategy augmented with cumulative context windows simulating longitudinal development workflows. Each prompt includes a global description, current sub-problem specifics, and all prior code—probing models' abilities in incremental reasoning, long-range state maintenance, and adherence to complex interface contracts.

Pass@1 accuracy is measured both at the atomic sub-problem level and for holistic main problems. The LLM-Judge framework is deployed for cases where strict functional equivalence is not guaranteed by test cases alone, particularly addressing code style, control flow, and compliance with industrial design patterns.

Empirical evaluation comprises proprietary and open-weight models. Closed models include the Claude 4.5 suite, GPT-5 variants, and Google's Gemini, while open models encompass Qwen3, DeepSeek, and code-specialized variants. Both standard and reasoning-enhanced ("thinking mode") variants are extensively benchmarked.

Numerical Results and Failure Mode Stratification

On sub-problems, the highest overall Pass@1 accuracy is 68.1% (Claude 4.5 Opus), with subsequent tiered performance among Gemini, GPT-5, and leading open models. Main problems, representing integrated system tasks, see a clear performance drop, with Claude 4.5 Opus at 42.5%. The execution gap—defined as the accuracy differential between sub- and main-problems—remains significant for all models, suggesting persistent limitations in multi-step orchestration, error recovery, and long-horizon context retention.

Figure 3: Performance comparison on main problems and sub-problems. Strong performance in sub-tasks correlates with improvements at the system level, but an execution gap persists.

Figure 4: Distribution of failure cases in IndustryCode. Syntax errors and prompt misinterpretation dominate, while pure logical errors are a minority.

Detailed error analysis reveals that syntax errors (32.8%) and misunderstanding the problem statement (30.2%) account for the majority of failures, particularly in domain-specific and less-documented languages (e.g., MATLAB, Stata). Hallucinations—often manifesting as fabricated APIs—account for 19.6%. Notably, logical reasoning errors are rare, indicating LLMs' default propensity for pattern-matching and code synthesis over deep semantic abstraction.

Figure 5: Performance distribution of sub-problem in IndustryCode. Domain-level stratification highlights clear disparities, with finance and IT outperforming hardware-constrained sectors.

Figure 6: Performance distribution of mainproblem in IndustryCode. Pass@1 distributions by domain highlight critical bottlenecks in complex integration tasks.

Theoretical and Practical Implications

These findings strongly indicate bottlenecks rooted in architecture and pretraining data distribution:

Data Scarcity Bottleneck: Models exhibit a wide gap between high-resource (Python, C++) and low-resource (MATLAB, Stata) languages, attributable to the availability of data during pre-training. Even SOTA models with vast parameter counts cannot compensate for the lack of proprietary training data with specialized syntax and semantics.
Domain Specialization: Domains with well-established algorithmic paradigms (IT, finance) permit efficient pattern matching, whereas engineering disciplines requiring concurrent logic (e.g., HDL, FEA, optimization control) expose the mismatch between sequential generation and inherently parallel semantics.
Reasoning Mode Dynamics: Activation of explicit reasoning (chain-of-thought) reduces logical errors but exacerbates context confusion and misinterpretation, as the models overfit to generalized engineering heuristics at the expense of local domain fidelity.
Figure 7: Impact of Thinking Mode on the distribution of failure modes. Reasoning mitigates logical errors but sharply increases context confusion.
Statefulness and KV Cache Decay: All models exhibit state tracking failures over extended completions, manifesting as variable redefinitions and partial logic truncation, especially in long-horizon synthesis tasks.

Model Comparison and Architecture-Level Insights

The empirical landscape underscores architectural trade-offs:

Mixture-of-Experts Advantage: Open models (Qwen3-Max) utilizing Mixture-of-Experts architectures excel in domains with standardized code, optimizing parameter utility and cross-domain adaptation.
Long-context Fidelity: Claude's supremacy is not solely due to model scale, but superior segregation of reasoning and syntax layers and maintenance of extended KV cache integrity, minimizing cognitive over-correction and symbol table decay.
Prompt Engineering Limits: Prompt modifications and explicit reasoning heuristics partially compensate for structural limitations but amplify risk of context leakage, hallucination, and overengineering.

Cross-Industry Generalization and Application Gap

Success rates for high-level integration tasks remain low; even when models achieve >60% accuracy on modular code generation, main problem integration is limited to 42%. This quantifies the execution gap and validates the need for specialized agentic architectures capable of robust multi-step planning, error correction, and symbol tracking.

Figures 19 and 20 further illustrate sector-specific adaptation and consistent accuracy gaps between atomic and holistic tasks, supporting a stratified approach to future research and deployment strategies for industrial AI.

Conclusion

IndustryCode represents a crucial advance in industrial code generation benchmarking, integrating hierarchical decomposition, cross-domain coverage, multi-lingual support, and real-world data curation. The observed performance plateau and failure mode taxonomy expose fundamental architectural, data, and reasoning limitations in current SOTA LLMs. While progress is manifest—especially in open MoE models and closed long-context architectures—genuine industrial AI deployment will require advances in data curation pipelines, domain-adaptive pretraining, and architectures engineered explicitly for stateful, multi-agent, and multi-language workflows.

Ongoing research must address the execution gap between atomic code proficiency and holistic system synthesis, with future benchmarks extending assessment to even more diverse, partially proprietary industrial environments, and closed-loop agentic workflows. IndustryCode provides the foundational scaffolding and analytic rigor necessary for the coming generation of industrial AI systems and serves as a template for verticalized, high-fidelity evaluation of code generation models.

Reference: "IndustryCode: A Benchmark for Industry Code Generation" (2604.02729)

Markdown Report Issue