CMPhysBench: LLM Benchmark for Condensed Matter
- CMPhysBench is a benchmark designed to evaluate LLMs on graduate-level condensed matter physics problems through complex, calculation-based questions.
- It covers six key domains—Magnetism, Superconductivity, Strongly Correlated Systems, Semiconductors, Theoretical Foundations, and Others—with diverse answer types such as equations, numeric, and expressions.
- It introduces the innovative SEED metric, enabling fine-grained, partial credit evaluation by comparing solution structures and intermediate steps.
CMPhysBench is a benchmark designed to rigorously evaluate the problem-solving proficiency and domain-specific mathematical reasoning of LLMs in condensed matter physics. Distinguished from earlier physics benchmarks that largely emphasize introductory content and multiple-choice formats, CMPhysBench is composed of more than 520 graduate-level, calculation-focused questions spanning representative subfields and theoretical frameworks within condensed matter physics. It is coupled with a novel partial credit scheme—Scalable Expression Edit Distance (SEED)—that yields a fine-grained assessment of generated solutions. The dataset and code are publicly available at https://github.com/CMPhysBench/CMPhysBench, facilitating transparent evaluation and reproducible research in the LLM community (Wang et al., 25 Aug 2025).
1. Motivation and Benchmark Construction
CMPhysBench was developed to address specific limitations of physics-focused LLM benchmarks, which have previously centered on high school or undergraduate-level questions and frequently relied on answer formats that do not capture step-by-step mathematical reasoning. The benchmark's construction process involved extracting and adapting calculation problems from authoritative graduate-level condensed matter physics textbooks, subsequently curated and validated by Ph.D. students and postdoctoral researchers for scientific rigor and clarity.
All questions are open-ended and require the LLM to produce comprehensive solutions, including intermediate mathematical steps and physical interpretations. This exclusive focus on calculation problems ensures that a model’s ability to manipulate advanced formalism and apply nontrivial physical principles is directly measured, as opposed to mere recognition or selection tasks.
2. Taxonomy of Content and Answer Types
The CMPhysBench corpus is organized into six top-level domains: Magnetism, Superconductivity, Strongly Correlated Systems, Semiconductors, Theoretical Foundations, and Others. The first four domains encompass both classic and contemporary themes in condensed matter physics, while Theoretical Foundations covers essential concepts from quantum field theory, spontaneous symmetry breaking, and statistical mechanics. The Others category allows inclusion of peripheral yet relevant topics such as quantum mechanics and electrodynamics.
Beyond domain categorization, every problem is annotated according to one of five distinct answer types: tuple, equation, numeric, expression, and interval. This rigorous annotation reflects the diversity of problem formats encountered in graduate education and research, such as deriving critical exponents, computing band gaps, calculating magnetic susceptibilities, or presenting intervals for phase transitions.
Domain | Example Subtopics | Answer Types |
---|---|---|
Magnetism | Spin chains, ferromagnetism | Equation, Numeric |
Superconductivity | BCS theory, London equations | Tuple, Equation |
Strongly Correlated Sys. | Hubbard model, Mott transitions | Expression, Interval |
Semiconductors | Band theory, effective mass | Numeric, Tuple |
Theoretical Foundations | QFT, symmetry breaking, statistics | Equation, Expression |
Others | QM, ED, QFT (outside CM core) | All types |
This organization ensures that both breadth and depth of content are tested, while diverse answer formats reflect the authentic demands of condensed matter practice.
3. The SEED Evaluation Metric
Traditional binary scoring metrics are insufficient for evaluating solutions to open-ended calculation problems that may differ in notation, units, or structural formulation. CMPhysBench introduces the Scalable Expression Edit Distance (SEED) score, which leverages robust LaTeX parsing, physics-aware normalization, and abstract syntax tree (AST) representations of both predicted and ground-truth answers.
The SEED metric computes a tree-edit distance between the prediction and reference, allowing fine-grained (non-binary) partial credit for solutions that demonstrate partial correctness or minor deviations (such as unit mismatches or differences arising from physically equivalent transformations). Unit conversions and scientific-notation normalization are explicitly supported, and the metric is robust against noisy LaTeX rendering.
For instance, two expressions representing the magnetic susceptibility—differing by a factor from dimensional analysis or missing an additive constant—will receive graded SEED scores reflective of their structural similarity and physical proximity. Earlier metrics, such as EED, struggled with expression diversity and LaTeX noise; by comparison, SEED effectively discriminates the depth and procedural accuracy characteristic of advanced calculation-based answers.
4. Model Performance and Capability Gap
Empirical evaluation on CMPhysBench reveals a substantial capability gap for all existing public and proprietary LLMs. State-of-the-art models, exemplified by Grok-4, achieve an average SEED score of 36 and only 28% expert-labeled accuracy across the benchmark. Other leading models, including o3 and Gemini 2.5 Pro, exhibit similar limitations.
The low scores persist even after partial crediting, indicating that contemporary LLMs are largely insufficient at solving nontrivial condensed matter physics problems requiring multi-step symbolic reasoning, advanced mathematical manipulations, and domain-specific derivations. These results highlight a pronounced frontier between general mathematical fluency and the nuanced expertise required for condensed matter theory.
A plausible implication is that models trained with broader scientific corpora may still struggle without condensed matter–specific pretraining or physics-aware architectural adaptations. The gap, observable in both aggregate SEED and accuracy measures, quantifies the distance remaining before LLMs can reliably assist in research-grade condensed matter computation.
5. Dataset Release, Evaluation Pipeline, and Extensibility
Both the CMPhysBench dataset and associated codebase—including question metadata, ground-truth solutions, and the SEED metric implementation—are openly available for academic and industrial research. This facilitates transparent benchmarking, reproducibility, and future extension.
Potential future directions, as noted by the authors, include:
- Integration of physics-aware verification into decoding (e.g., enforcing conservation laws or dimensional consistency checks within generated solutions).
- Coupling LLMs to symbolic mathematics engines to enhance their algebraic manipulation capabilities.
- Developing curricular benchmarks focused on canonical derivations and the synthesis of abstract theory with computational practice.
Such improvements are expected to guide targeted retraining, curriculum design, and architectural innovation, narrowing the observed gap and promoting reliable cross-disciplinary application of LLMs in computational physics (Wang et al., 25 Aug 2025).
6. Research and Pedagogical Impact
CMPhysBench is positioned as both a diagnostic tool and a catalyst for research. The benchmark provides granular feedback on reasoning deficiencies, informs the design of physics-aligned LLM evaluation protocols, and enables quantifiable progress as models evolve. For educators and curriculum designers, CMPhysBench’s taxonomy and scoring schema offer insights into areas where LLMs excel or underperform, guiding instructional strategies for integrating AI tools into graduate-level training.
For the research community, CMPhysBench serves as a standardized and domain-tailored reference benchmark against which future LLM architectures, training strategies, and symbolic-augmented models can be assessed. Its design aligns with contemporary trends in physics-aware machine learning, offering a robust foundation for linking model diagnostics to practical, research-grade tasks—a necessary step toward scientific reasoning and real-world deployment in condensed matter physics.