Expression Edit Distance (EED) Score
- The Expression Edit Distance (EED) Score is a metric that quantifies the structural similarity between symbolic expressions by computing the minimal edit operations on their expression trees.
- It utilizes an adapted Zhang–Shasha tree-edit algorithm with subtree discounting to assess partial correctness in model-generated vs. reference answers.
- This continuous scoring method improves sample efficiency and provides nuanced insight over traditional binary metrics by rewarding structurally similar, nearly-correct solutions.
The Expression Edit Distance (EED) Score is an automated metric that quantifies the similarity between symbolic mathematical expressions using an edit-distance-based approach over expression trees. Developed to enable fine-grained, continuous assessment of model-generated symbolic answers—especially within demanding benchmarks like PHYBench—the EED Score addresses the limitations of binary correctness metrics by providing partial credit for partially correct or structurally similar responses.
1. Definition and Motivation
The EED Score is designed to measure the degree of alignment between two symbolic expressions, typically represented in LaTeX or another symbolic format, by calculating the minimum number of structural edits required to transform a model-generated answer into the reference solution. The metric operates over tree representations of mathematical expressions, capturing their hierarchical composition of operators, symbols, and functions.
This approach was introduced within the PHYBench benchmark for evaluating LLMs on complex physics reasoning tasks, where answers are often partially correct and contain subtle errors in structure or coefficient that binary metrics cannot account for. By outputting a continuous score reflecting the degree of closeness between the candidate and the ground truth, the EED Score enables more nuanced and informative evaluation of model reasoning (Qiu et al., 22 Apr 2025).
2. Mathematical Formulation
The EED Score relies on measuring the edit distance between the tree (abstract syntax tree or AST) of the ground truth expression, , and that of the candidate expression, . The scoring procedure is as follows:
- Symbolic Parsing and Simplification: Both expressions are parsed into symbolic trees using a computer algebra system (e.g., via Sympy), and simplified to normalize structurally equivalent forms.
- Tree Construction: The simplified expressions are represented as trees, with nodes corresponding to operators, operands, and sub-expressions.
- Extended Edit Distance Calculation: An adaptation of the Zhang–Shasha tree edit distance algorithm is used to compute the minimal sequence of node insertions, deletions, and updates that transform into .
- Relative Edit Distance: The normalized edit distance is computed:
Where Size() is the total number of nodes in the ground truth tree.
- Piecewise Linear Score Mapping:
For exact matches, the score is 100. For partially correct answers (where ), a linearly decreasing score is assigned. Answers with significant structural divergence () receive a score of zero.
- Subtree Discounting: Edits involving larger subtrees (more than five nodes) are discounted, with a cost set to 60% of the total node-by-node edit cost. This accommodates cases where entire physical subcomponents are omitted or inserted, reflecting their semantics in a single operation.
3. Sample Efficiency and Statistical Properties
The EED Score confers significantly improved sample efficiency compared to binary scoring. Empirical analysis finds that using the EED metric increases sample efficiency by approximately 204% (standard deviation ~80%), meaning that the discriminatory power achievable with 500 problems scored by EED is equivalent to that of roughly 1000 problems evaluated by binary accuracy. The continuous-valued nature and broader variance distribution ensure that smaller performance differences between systems are detected with fewer evaluation samples (Qiu et al., 22 Apr 2025).
4. Implementation and Practical Workflow
In practice, the EED Score is implemented as part of the PHYBench evaluation pipeline:
- Model outputs and ground truth answers are procured as LaTeX-coded symbolic expressions.
- Both are parsed and simplified using tools like Sympy to produce comparable trees.
- The extended Zhang–Shasha tree edit distance algorithm is applied to these trees.
- The resulting edit distance is normalized and mapped to a score as described above.
The subtree discounting mechanism is incorporated during edit sequence computation, ensuring that structural operations on substantial expression components are not penalized overly relative to finer-grained modifications.
5. Role within PHYBench and Model Evaluation
PHYBench uses the EED Score to evaluate the reasoning performance of LLMs on a challenging set of 500 original physics problems. These problems often require multi-step manipulations and result in complex expressions, where models rarely achieve exact correctness. The EED Score distinguishes correctness gradations: minor mistakes yield modest penalties, while fundamental misunderstandings or structural failures result in low scores.
The metric enables:
- Finer differentiation among models that would otherwise cluster at zero or perfect accuracy in binary metrics.
- Continuous tracking of skill improvements and errors, revealing incremental advances in reasoning capability.
- Robust, data-efficient benchmarking across a wide range of question difficulties and answer complexities.
6. Comparison with Binary and Structural Metrics
Traditional metrics, such as accuracy or exact match, reduce evaluation of symbolic answers to binary outcomes—fully correct or incorrect. Such metrics provide little insight into partial correctness, especially for complex, multi-stage answers or when models are near to but not fully correct.
The EED Score’s tree-edit approach offers several key advantages:
- Continuity: Assigns partial credit proportional to the degree of structural similarity.
- Granularity: Differentiates between small, local errors and major conceptual gaps.
- Broad Distribution: Produces a more informative performance histogram, improving statistical reliability and ranking sensitivity.
By combining symbolic simplification and tree-edit algorithms, the EED Score avoids pitfalls where semantically equivalent but syntactically different expressions would otherwise be rated as incorrect.
7. Implications and Broader Applications
The introduction of the EED Score in PHYBench sets a precedent for nuanced, structure-aware evaluation of symbolic outputs. Possible implications and extensions include:
- Application to other benchmarks where outputs are structured, such as theorem proving, code synthesis, chemistry, or logic tasks.
- Use as a training reward signal to encourage models to progress incrementally toward correct structured reasoning.
- Inspiration for the design of further edit-distance family metrics tailored to domain-specific structures (e.g., programming language ASTs or chemical reaction graphs).
This suggests that structure-aware edit distance approaches could play a central role in future evaluation frameworks for AI systems required to produce or manipulate complex symbolic representations.