FEA-Bench: LLMs in Simulation & Code Generation
- FEA-Bench is a dual benchmark that evaluates LLMs on multiphysics simulation via finite element analysis and repository-level code generation for automated feature implementation.
- It employs rigorous metrics such as executability, model tree score, and numerical error thresholds to quantify simulation fidelity and code patch effectiveness.
- The framework integrates multi-agent LLM architectures for iterative refinement, overcoming challenges like API misusage, geometry errors, and incomplete multi-file edits.
FEA-Bench denotes two distinct, high-impact benchmarks targeting the evaluation of LLMs in complex reasoning and code-generation domains: (1) a multiphysics reasoning benchmark utilizing finite element analysis (FEA) through industrial-grade simulation software, and (2) a repository-level code generation benchmark for automated feature implementation in real-world software projects. Both frameworks are designed to expose fundamental challenges in LLM capabilities well beyond simple code completion, requiring multi-step reasoning, precise adherence to specification, and integration across disparate artifacts or APIs.
1. Multiphysics Reasoning with FEA: Motivation, Scope, and Problem Classes
The first instantiation, described in "FEABench: Evaluating LLMs on Multiphysics Reasoning Ability" (Mudur et al., 8 Apr 2025), assesses LLMs and LLM-driven agents on their capacity to solve engineering, mathematical, and scientific problems by leveraging FEA. FEA is foundational to modern engineering analysis, enabling simulation of physical systems governed by partial differential equations (PDEs) involving, inter alia, thermal, structural, and fluid physics. Effective simulation demands proficiency in:
- Parsing and formalizing complex, natural-language specifications into rigorous mathematical or computational objects.
- Managing geometry, mesh generation, boundary/initial conditions, and solver tuning within high-fidelity FEA platforms—specifically, COMSOL Multiphysics®.
- Coupling multiple physics domains and accurately running numerical solvers to generate quantitative results.
The benchmark encompasses a spectrum of canonical FEA problem classes:
| Problem Domain | Example Problem Types |
|---|---|
| Heat Conduction | Steady/transient heat transfer in solids and shells |
| Structural Mechanics | Plane stress/strain, beam bending, membrane analysis |
| Fluid Dynamics | Laminar flow, creeping flow, porous media |
| Multipurpose PDEs | Black–Scholes, eigenfrequency in quantum dots/beams |
Each instance is articulated as both an expressive natural-language specification (model specs) and as a procedural "Plan" describing stepwise construction of the target model in the COMSOL GUI.
2. Integration with COMSOL Multiphysics®: API Translation and Workflow
FEABench tasks require models to translate input descriptions into executable COMSOL API calls, primarily in Java (or in Python via the MPh interface). The process includes:
- Model definition (
model.component().create(...)), geometry instantiation (geom().create(...)), meshing. - Physics interface and boundary/initial condition setup (e.g.,
physics("ht").create("temp1", ...)for heat transfer). - Study configuration and solver invocation (
study().create(...); study().run()). - Postprocessing: automated result extraction (e.g., via
.result().numerical().create(...).set("expr", ...)).
Key mathematical underpinnings involve the strong and weak forms of steady conduction, general discretizations (), and strict L²-norm error metrics (). This structure enforces mapping from abstract physics to concrete, numerically robust simulation constructs.
3. Evaluation Protocols and Quantitative Metrics
FEABench employs a multi-layered evaluation methodology to capture execution, structural, and domain precision:
- Executability: Fraction of generated API calls that can be syntactically and semantically executed without error.
- Model Tree Score: Structural similarity between generated and ground-truth COMSOL model trees (normalized, $1.0$ for exact match).
- Physics Metrics: Interface Factuality (valid interface assignment), Interface Recall, Feature & Property Recall, Feature Dimension Consistency.
- Valid Target and Numerical Accuracy: Automated verification that exported quantities are as intended, with relative error classified as "Strict" success.
Performance of single-turn LLMs is currently limited—for baseline problems, leading models such as Claude-3.5-Sonnet, GPT-4o, and Gemini-1.5-Pro achieve Executability ranks of $0.79$, $0.78$, and $0.60$ respectively, but do not solve any Gold tasks under strict numeric criteria in one shot. Multi-agent systems leveraging iterative refinement (ControllerAgent, Evaluator, CorrectorSubAgent, ToolLookupAgent) improve Executability to $0.88$, but only attain strict numeric correctness on a minority of problems (Mudur et al., 8 Apr 2025).
4. LLM Agent Architecture and Iterative Refinement
The benchmark introduces a hybrid, multi-agent LLM architecture to exploit tool-use and feedback:
- ControllerAgent orchestrates solution search/selection and stopping criteria.
- Evaluator executes generated code, compiles line-level success, and triggers external LLM verification.
- CorrectorSubAgent ingests feedback and proposes refined code blocks.
- ToolLookupAgent retrieves interface/feature listings, inspects model tree properties, and conducts code snippet retrieval (from a database of 768 annotated examples).
Agents iterate through population-based solution sampling and correction, halting upon convergence or budget exhaustion. Tools support dynamic troubleshooting, including resolution of error classes such as geometry dimension mismatches, hallucinated feature/interface names, and study/solver misconfiguration.
5. Repository-Level Feature Implementation: Benchmark Description
The second benchmark, presented in "FEA-Bench: A Benchmark for Evaluating Repository-Level Code Generation for Feature Implementation" (Li et al., 9 Mar 2025), focuses on the ability of LLMs to perform incremental software development in real-world repositories—specifically feature addition rather than bug-fixing.
Distinguishing features include:
- Tasks are drawn from 1,401 merged pull requests across 83 Python repositories (filtered by license, PR/test viability, and semantic intent for feature addition).
- Each task provides the PR description, signatures/docstrings of new functions/classes, and relevant context (commits, setup scripts).
- LLMs receive reference data (excluding solutions/tests) and must edit the repository so all provided unit tests pass post-edit.
Metrics and protocol:
- Resolved Ratio (Task Success Rate): Proportion of tasks where all tests pass after LLM patch/edits.
- Patch-Apply Rate: Fraction of generated patches successfully applied (syntactically/structurally correct diffs).
- Retrieval Precision/Recall: For BM25-based context selection (matching ground-truth edited files).
- Metrics stratified by prompt “hints” quality (brief/detailed), context scope, and output diff format.
Current performance levels are low: DeepSeek-R1 achieves success, GPT-4o , with models struggling to coordinate multi-file edits, maintain repository consistency, and synthesize nontrivial new logic. Scaling context size reduces retrieval precision; stricter patch formatting further degrades performance. The success rate declines sharply with increasing functional complexity (# functions/classes added per PR).
6. Results, Limitations, and Insights
FEABench exposes several persistent challenges in both domains:
- LLMs exhibit brittle handling of API surface area and domain-specific nomenclature; hallucinations and misparsing are common.
- Minor errors in geometry specification or solver tuning yield completely invalid outputs in engineering simulation contexts.
- In repository-level code generation, failure modes include missing imports, incorrect API usage, partial/non-atomic edits, and malformed patches.
- Automated, multi-step refinement methods (LLM agents or iterative patching) offer moderate gains in elementary metrics but do not solve strict correctness bottlenecks.
- For multiphysics FEA, explicit procedural "Plan" cues only marginally improve success rates over pure model-spec parsing.
7. Future Directions and Benchmark Evolution
Envisioned enhancements for FEABench include:
- Expanded problem sets (advanced CAD, time-dependent and multiphysics coupling, broader FEA and CAD platforms such as ANSYS and Abaqus).
- Richer benchmarking of cross-platform API generalization and integration.
- Novel agent-level interventions: improved prompt engineering, retrieval-augmented generation, enhanced semantic/contextual retrieval.
- In software engineering, leveraging iterative or multi-turn prompting, static analysis/type inference, more precise file retrieval, and support for additional programming languages and real-world repo artifacts.
- Continuous updates and public pipelines facilitating community-driven expansion.
The unifying thrust of FEA-Bench is to bring LLM evaluation closer to the demands of real scientific and engineering workflows, where end-to-end automation, robust domain grounding, and numerical precision are critical. This sets benchmarks that are not only execution-based but also reflective of production context and practical value—an essential step for advancing LLMs into deeper integration with scientific computing and automated software engineering (Mudur et al., 8 Apr 2025, Li et al., 9 Mar 2025).