ASTormer: Structure-Aware Transformer for Text-to-SQL
- ASTormer is an AST structure-aware Transformer decoder that enhances text-to-SQL generation by integrating explicit tree structure with node type, parent rule, and depth embeddings.
- It employs both absolute and relative position embeddings to capture detailed structural features, thereby increasing syntactic generalization and execution accuracy.
- The model supports multiple traversal and node selection strategies, demonstrating superior effectiveness and efficiency across five popular text-to-SQL benchmarks.
ASTormer is an AST structure-aware Transformer decoder designed for text-to-SQL tasks, where the objective is to generate an executable SQL program given a natural language utterance and a database schema. ASTormer addresses the limitations of traditional grammar-based recurrent decoders by leveraging Transformer-based architectures tailored to incorporate explicit structural knowledge of the abstract syntax tree (AST) underlying SQL queries. The framework integrates node type, tree position, and parent rule—providing richer structure priors—while supporting multiple traversal strategies for AST generation. Empirical results on five text-to-SQL benchmarks demonstrate ASTormer's gains in both effectiveness and efficiency over competitive RNN-based and Transformer baselines (Cao et al., 2023).
1. Architectural Overview
ASTormer replaces RNN-based decoders, such as those utilizing LSTM cells with parent feeding (e.g., TranX), with an autoregressive Transformer decoding module. The overall architecture comprises two main components:
- Encoder: Any graph- or sequence-based encoder (e.g., RATSQL) processes the user question , table names , and column names . This produces an encoder memory matrix , where is the embedding dimension.
- ASTormer Decoder: For each decoding step , an unexpanded “frontier” node in the partially constructed AST is selected. The input embedding for is formed by summing the previous action embedding , node-type embedding 0, parent-rule embedding 1, and node depth embedding 2, followed by Layer Normalization:
3
This embedding is passed through 4 stacked ASTormer layers, each containing masked multi-head self-attention, cross-attention to encoder memory, and a feedforward network, resulting in 5. The output module predicts the next action: ApplyRule, SelectItem, or GenToken. The action is then applied to expand the AST; the process repeats until no frontier nodes remain.
2. AST Structural Encoding
2.1 Absolute Position Embeddings
ASTormer encodes critical node-level signals within the tree structure using absolute position embeddings:
- Node type (6): Embeds the syntactic type of 7.
- Parent rule (8): Embeds the production rule applied at the parent node 9.
- Node depth (0): Encodes the tree depth for node 1.
The input embedding at each decoding step is thus a sum of previous action and these absolute signals, subsequently normalized.
2.2 Relative Position Embeddings
For every decoder position pair 2, a relative position tuple 3 is constructed as follows:
- Let LCA denote the lowest common ancestor of 4 and 5 in the AST.
- Set 6 and 7, each clamped within 8 for a preset maximum 9.
A lookup table 0 maps each 1 pair to a bias vector 2, which is then used throughout the self-attention computation to modify attention scores in a structure-aware manner. The encoding can capture ancestry (e.g., 3 implies 4 is a child of 5; 6 implies siblings).
3. Attention Mechanisms and Modifications
ASTormer employs a modified masked multi-head self-attention module augmented with structural bias. For head 7 with head dimension 8:
- The attention logit for position pair 9 is:
0
1
This can be condensed into matrix form as:
2
where 3 denotes the relative bias projections. Cross-attention to encoder memory 4 is unchanged from standard Transformer formulations.
4. Decoding Algorithm and Traversal Strategies
ASTormer supports multiple traversal paradigms and node selection strategies to maintain flexibility and performance:
- Frontier Management:
- Node Selection:
- L2R (Left-to-Right): Deterministically choose nodes following the schema order.
- Random: Uniformly sample one node during training, with all options expanded in the beam during inference.
The generic decoding algorithm initializes the AST with a root and maintains a beam of 5 partial trees. At each step, candidate expansions are produced for each active frontier node by predicting the action distribution 6 and all valid actions for the node. Actions are scored and the top 7 beams are retained for the next step. This procedure guarantees compatibility with various exploration regimes across training and inference.
5. Action Types and Output Integration
At each decoding step, the output module chooses from three action types tailored to grammar-based SQL AST construction:
- ApplyRule: Expands a nonterminal via a production rule.
- SelectItem: Selects schema elements (e.g., tables, columns).
- GenToken: Generates terminal tokens as required.
Each action is conditioned on the decoded node representation 8. The symbolic AST is updated accordingly, ensuring syntactic well-formedness throughout the decoding process.
6. Empirical Evaluation
Extensive experiments on five text-to-SQL benchmarks demonstrate that ASTormer surpasses previous RNN-based grammar decoders in both effectiveness—measured by execution accuracy and logical consistency—and efficiency, owing to its parallelizable and structure-aware Transformer design. The architecture is fully compatible with state-of-the-art encoder modules and supports a wide range of traversal and node selection variants (Cao et al., 2023).
A plausible implication is that ASTormer's explicit encoding of tree structure through both absolute and relative signals enables superior syntactic generalization and increases decoding efficiency compared to approaches without such priors.