Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
144 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Attention Generates Better Proofs (2504.19188v1)

Published 27 Apr 2025 in cs.LG, cs.AI, cs.CL, and cs.LO

Abstract: LLMs have shown promise in formal theorem proving, but their token-level processing often fails to capture the inherent hierarchical nature of mathematical proofs. We introduce \textbf{Hierarchical Attention}, a regularization method that aligns LLMs' attention mechanisms with mathematical reasoning structures. Our approach establishes a five-level hierarchy from foundational elements to high-level concepts, ensuring structured information flow in proof generation. Experiments demonstrate that our method improves proof success rates by 2.05\% on miniF2F and 1.69\% on ProofNet while reducing proof complexity by 23.81\% and 16.50\% respectively. The code is available at https://github.com/Car-pe/HAGBP.

Summary

  • The paper introduces a novel Hierarchical Attention method that aligns LLMs' token focus with a five-level proof structure in the Lean environment.
  • Experiments on miniF2F and ProofNet benchmarks show enhanced pass rates by up to 2.05% and reduced proof complexity by over 16%.
  • Ablation studies confirm that layer-wise adaptation and hierarchical guidance outperform traditional approaches in formal theorem proving.

LLMs have shown promise in formal theorem proving, but their token-level processing struggles to capture the hierarchical nature inherent in mathematical proofs. The paper "Hierarchical Attention Generates Better Proofs" (2504.19188) introduces Hierarchical Attention, a regularization method designed to align LLMs' attention mechanisms with the structural properties of mathematical reasoning, specifically within the Lean theorem prover environment.

The core idea is to impose a structured information flow constraint on the LLM's attention based on a five-level hierarchy identified in mathematical statements:

  1. Context Layer (T0T_0): Background information.
  2. Case Layer (T1T_1): Pattern matching and case analysis.
  3. Type Layer (T2T_2): Type declarations and definitions.
  4. Instance Layer (T3T_3): Instance declarations and examples.
  5. Goal Layer (T4T_4): The theorem or proposition to be proved.

This hierarchy follows a natural partial order: contextcasetypeinstancegoalcontext \prec case \prec type \prec instance \prec goal. The method guides attention flow between tokens tit_i and tjt_j based on their hierarchical levels:

  • Unrestricted: Allowed between tokens at the same level (level(ti)=level(tj)\text{level}(t_i) = \text{level}(t_j)).
  • Guided: Preferred flow from lower to higher levels (level(ti)<level(tj)\text{level}(t_i) < \text{level}(t_j)).
  • Limited/Restricted: Attention flow from higher to lower levels (level(ti)>level(tj)\text{level}(t_i) > \text{level}(t_j)) is discouraged.

The implementation involves two steps:

  1. Extract Flow Pattern: A rule-based parsing algorithm identifies the hierarchical level of different components in the Lean theorem text using syntactic cues like case, Type, : and .
  2. Guide Attention: A flow loss (Lflow\mathcal{L}_{flow}) is introduced during training, penalizing attention weights that violate the hierarchical constraints (i.e., attention from higher to lower levels). This loss is calculated per layer ll and weighted by a layer-wise adaptation factor αl=1l/L\alpha_l = 1 - l/L, which reduces the constraint strength in deeper layers. The final training objective combines the standard LLMing loss (LLM\mathcal{L}_{LM}) with the flow loss: L=LLM+λLflow\mathcal{L} = \mathcal{L}_{LM} + \lambda \mathcal{L}_{flow}, where λ\lambda controls the regularization strength.

The method was evaluated by fine-tuning a Pythia-2.8B model on the LeanDojo Benchmark 4 dataset and testing its performance on miniF2F and ProofNet benchmarks using two evaluation strategies: best-first search (BFS) and single-pass sampling (SPS). The baseline for comparison was the LLMsTEP method.

The results demonstrate that Hierarchical Attention improves both the proof success rate (pass@K) and proof conciseness (measured by the average complexity ratio RavgR_{avg}, where Ravg<1R_{avg} < 1 indicates shorter proofs).

  • On the miniF2F test set with BFS (K=64K=64), the pass rate increased by 2.05% (from 29.51% to 31.56%), and the proof complexity was reduced by 23.81% (Ravg=0.76R_{avg}=0.76). Similar improvements were observed on the miniF2F validation set.
  • On the ProofNet test set with BFS (K=64K=64), the pass rate increased by 1.69% (from 13.56% to 15.25%), and the complexity was reduced by 16.50% (Ravg=0.84R_{avg}=0.84). Similar gains were seen on the ProofNet validation set.
  • Even with the simpler SPS strategy, the method showed significant pass rate improvements (e.g., 4.51% on miniF2F test at K=64K=64).

Attention pattern analysis confirmed that the method successfully enforces the limited flow constraint, significantly reducing attention from higher to lower levels compared to the baseline, even in layers without explicit constraints. This suggests that the hierarchical structure is internalized by the model. Ablation studies highlighted the benefits of the layer-wise adaptation for achieving better pass rates and showed that the fine-grained 5-level hierarchy generally performs better than a coarse-grained variant. A comparison with a baseline using explicit structural tags in the input showed that the attention guidance approach is significantly more effective.

Case studies illustrate how the method leads to more concise proofs by enabling the model to directly apply relevant information from lower levels to the goal, avoiding unnecessary intermediate steps.

The paper concludes that Hierarchical Attention is a promising direction for improving the mathematical reasoning capabilities of LLMs by explicitly guiding their attention according to the inherent structure of proofs. Limitations include the dependence of the hierarchy definition on Lean's semantics, the fixed nature of the hierarchy, and the lack of evaluation on larger, more advanced models. Ethical considerations include the use of public data and the need for human oversight when using AI for theorem proving.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets