Equality Graphs (E-Graphs) Overview
- Equality graphs are data structures that partition symbolic expressions into e-classes of semantically equivalent terms using e-nodes and shared subtrees.
- They enable equality saturation by exhaustively applying rewrite rules to capture all potential optimizations, thereby avoiding phase-ordering challenges.
- E-graphs improve efficiency in domains such as compiler optimization, symbolic regression, and combinatorial search by compacting redundant expression representations.
Equality graphs (e-graphs) are data structures designed for the simultaneous, compact representation of a vast number of equivalent symbolic expressions or program terms. Initially developed in the context of rewrite-based program optimization, e-graphs have since become foundational in compiler optimizations, symbolic regression, and search algorithms where symbolic equivalence and state abstraction dramatically impact efficiency and solution quality. By merging subexpressions identified as equivalent under a set of rewrite rules, e-graphs enable equality saturation—the process of exhaustively applying rewrites to discover all reachable forms without phase-ordering concerns. Recent research has realized the utility of e-graphs well beyond traditional rewriting, notably in pruning redundant exploration within Monte Carlo Tree Search (MCTS), supporting deep reinforcement learning (DRL) on symbolic spaces, and clustering decision nodes in combinatorial search domains.
1. Definition and Core Structure
An e-graph is a congruence-closed graph over terms, partitioned into e-classes, each representing a set of syntactically distinct but semantically equivalent expressions. Each e-class contains one or more e-nodes, which are operator applications whose children are references to other e-classes. This indirect, graph-based structure allows shared subtrees between e-classes, so a single e-graph can concisely encode exponentially many equivalent expressions. Merge operations maintain equivalence relations, using union-find over e-class identifiers, and a hash-consing table to canonicalize operator applications.
Formally, given a context-free grammar and a set of rewrite rules (e.g., distributivity, commutativity, trigonometric and logarithmic identities), the e-graph can be initialized from a term tree, expanded via matched applications of , and queried for e-classes containing given expressions:
- E-class: A set of expressions equivalent under .
- E-node: An operator referencing its child e-classes.
Symbolic equivalence is defined by
where denotes zero or more applications of rewrites in .
2. Equality Saturation and Extraction
Equality saturation is the core methodology enabled by e-graphs, allowing all possible rewrite opportunities to be captured by saturating the e-graph. In each iteration, for every e-class and for each rule in , a matching instantiates into new e-nodes, merging their results with the original class as necessary. This process continues until no further matches are possible or a resource limit is reached (e.g., total e-nodes ). Extraction proceeds by traversing the saturated e-graph to identify an expression (e.g., shortest AST) optimized for a cost metric (via greedy or ILP).
Advantages of equality saturation include eliminating the phase-ordering problem typical in sequential term rewriting and enabling global optimality across all reachable rewrites. However, if saturation is incomplete due to resource limits, the construction-phase phase-ordering problem can be reintroduced, whereby the choice and sequence of rules applied impact the result (He et al., 2023).
3. E-graphs in Symbolic Regression and Monte Carlo Tree Search
In symbolic regression, the search space is vast due to the combinatorics of grammar-based expression generation. Traditional search methods—MCTS, DRL, LLMs—treat each expression as a distinct output, leading to redundant exploration when many expressions are functionally identical. The EGG-SR framework (Jiang et al., 8 Nov 2025) embeds e-graphs directly into search algorithms, compactly representing equivalence classes and pruning redundant search.
EGG-MCTS integrates e-graphs into Monte Carlo Tree Search for symbolic regression:
- EGG-MCTS workflow: At each node (partial rule sequence), a local e-graph is built and saturated under the rewrite system. At backpropagation, rewards from rollouts are propagated not just to the traversed node, but also to all nodes corresponding to equivalent expressions sampled from the e-graph.
- Pseudocode: EGG-MCTS extends standard MCTS by this reward propagation mechanism using e-graph-derived equivalence.
- Regret Bound: By merging equivalent paths, EGG-MCTS reduces the effective branching factor , yielding a provably tighter regret bound compared to standard MCTS:
- Implementation: E-graph operations—match, substitute, merge—scale linearly with . Hash-consing and union-find enable efficient management. Overheads for e-graph construction and extraction are negligible versus rollout and coefficient fitting.
4. E-graph-Guided Rewrite Planning and the Phase-Ordering Problem
MCTS-GEB (He et al., 2023) demonstrates how MCTS can guide e-graph construction itself, instead of the naive round-robin firing of rewrites, which is sensitive to the infamous phase-ordering problem when the e-graph cannot be saturated due to resource constraints. Here, e-graph construction is framed as an MDP:
- State: Current e-graph (tracked by the sequence of applied rewrite rules).
- Action: Choice of rewrite rule to apply.
- Transition: Deterministic update of the e-graph by matching and applying the chosen rule.
- Reward: Improvement in cost (e.g., extracted AST size) after simulating extractors following random rule applications.
The use of MCTS allows the planner to focus construction budget (limited by the e-node cap) on high-reward rewrites, which can produce dramatically smaller extracted expressions (up to 49x shorter) compared to baseline solutions. Parallel rollout and replay of action sequences avoids storing all intermediate e-graphs. MCTS-GEB introduces overhead, but this is bounded and acceptable for non-latency-sensitive applications.
5. State Abstraction, Clustering, and Elastic E-graph Methods
Elastic MCTS (Xu et al., 2022) generalizes the e-graph concept from symbolic expressions to game state abstraction. Here, nodes are clustered dynamically by behavioral similarity, based on reward and transition function proximity (approximate MDP homomorphism). The state abstraction function maps concrete states to abstract clusters, which merge every MCTS steps and split back after an iteration threshold . Such elastic abstraction allows early-stage search to benefit from compression, followed by full-fidelity exploration.
Concretely, clustering merges nodes , into the same abstract cluster if, for all actions ,
Abstraction yields an order-of-magnitude reduction in search-tree size and large empirical gains in combinatorial games.
6. Empirical Results and Impact Across Domains
Key empirical findings from the referenced literature include:
- Symbolic Regression: EGG-MCTS reduces normalized mean squared error (NMSE) by 1–2 orders of magnitude over vanilla MCTS. For example, in noiseless trigonometric regression with arity , EGG-MCTS attains NMSE vs $0.033$ for standard MCTS. Node visitation and search depth are both improved (Jiang et al., 8 Nov 2025).
- Expression Simplification and Program Optimization: MCTS-GEB achieves up to 49x shorter extracted expressions in competitive logic-simplification benchmarks (He et al., 2023).
- Game Search: Elastic MCTS compresses trees by a factor of 10 and boosts win rates by over unit-ordered MCTS baselines in complex, high-branching board games (Xu et al., 2022).
- Efficiency: E-graph overhead (memory and run time) is negligible in symbolic regression and symbolic rewriting scenarios. In Elastic MCTS, abstraction adds only 18 ms to each game move.
7. Limitations and Practical Considerations
E-graphs scale well due to subexpression sharing: representation size is polynomial in the number of rewrites, despite an exponential number of actual variants. However, memory can grow rapidly for large or deep term trees. Resource (e-node) limits are required in practice. The choice of rewrite set and policy for constructing and saturating e-graphs remains critical for solution quality, particularly in incomplete-saturation regimes where planner-driven approaches (EGG-MCTS, MCTS-GEB) are essential. For complex MDPs and strategy games, carefully tuned abstraction parameters (tolerances, merge/split cycles) are required to achieve optimal trade-offs between search efficiency and fidelity.
E-graphs have established a new foundation for symbolic reasoning tasks across AI—compilers, program optimizers, symbolic regression engines, and planning agents—wherever symbolic equivalence and state abstraction play central roles. Existing implementations leverage Python, hash-consing, union-find, and optional graph visualization tools such as Graphviz. Empirical, theoretical, and architectural advances continue to extend the reach and impact of e-graphs in symbolic AI.