Papers
Topics
Authors
Recent
Search
2000 character limit reached

ASTormer: Structure-Aware Transformer for Text-to-SQL

Updated 17 April 2026
  • ASTormer is an AST structure-aware Transformer decoder that enhances text-to-SQL generation by integrating explicit tree structure with node type, parent rule, and depth embeddings.
  • It employs both absolute and relative position embeddings to capture detailed structural features, thereby increasing syntactic generalization and execution accuracy.
  • The model supports multiple traversal and node selection strategies, demonstrating superior effectiveness and efficiency across five popular text-to-SQL benchmarks.

ASTormer is an AST structure-aware Transformer decoder designed for text-to-SQL tasks, where the objective is to generate an executable SQL program given a natural language utterance and a database schema. ASTormer addresses the limitations of traditional grammar-based recurrent decoders by leveraging Transformer-based architectures tailored to incorporate explicit structural knowledge of the abstract syntax tree (AST) underlying SQL queries. The framework integrates node type, tree position, and parent rule—providing richer structure priors—while supporting multiple traversal strategies for AST generation. Empirical results on five text-to-SQL benchmarks demonstrate ASTormer's gains in both effectiveness and efficiency over competitive RNN-based and Transformer baselines (Cao et al., 2023).

1. Architectural Overview

ASTormer replaces RNN-based decoders, such as those utilizing LSTM cells with parent feeding (e.g., TranX), with an autoregressive Transformer decoding module. The overall architecture comprises two main components:

  • Encoder: Any graph- or sequence-based encoder (e.g., RATSQL) processes the user question QQ, table names TT, and column names CC. This produces an encoder memory matrix XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}, where dd is the embedding dimension.
  • ASTormer Decoder: For each decoding step jj, an unexpanded “frontier” node njn_j in the partially constructed AST yj1ay_{j-1}^a is selected. The input embedding for njn_j is formed by summing the previous action embedding aj1Rda_{j-1} \in \mathbb{R}^d, node-type embedding TT0, parent-rule embedding TT1, and node depth embedding TT2, followed by Layer Normalization:

TT3

This embedding is passed through TT4 stacked ASTormer layers, each containing masked multi-head self-attention, cross-attention to encoder memory, and a feedforward network, resulting in TT5. The output module predicts the next action: ApplyRule, SelectItem, or GenToken. The action is then applied to expand the AST; the process repeats until no frontier nodes remain.

2. AST Structural Encoding

2.1 Absolute Position Embeddings

ASTormer encodes critical node-level signals within the tree structure using absolute position embeddings:

  • Node type (TT6): Embeds the syntactic type of TT7.
  • Parent rule (TT8): Embeds the production rule applied at the parent node TT9.
  • Node depth (CC0): Encodes the tree depth for node CC1.

The input embedding at each decoding step is thus a sum of previous action and these absolute signals, subsequently normalized.

2.2 Relative Position Embeddings

For every decoder position pair CC2, a relative position tuple CC3 is constructed as follows:

  • Let LCA denote the lowest common ancestor of CC4 and CC5 in the AST.
  • Set CC6 and CC7, each clamped within CC8 for a preset maximum CC9.

A lookup table XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}0 maps each XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}1 pair to a bias vector XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}2, which is then used throughout the self-attention computation to modify attention scores in a structure-aware manner. The encoding can capture ancestry (e.g., XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}3 implies XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}4 is a child of XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}5; XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}6 implies siblings).

3. Attention Mechanisms and Modifications

ASTormer employs a modified masked multi-head self-attention module augmented with structural bias. For head XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}7 with head dimension XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}8:

  • The attention logit for position pair XR(Q+T+C)×dX \in \mathbb{R}^{(|Q|+|T|+|C|) \times d}9 is:

dd0

dd1

This can be condensed into matrix form as:

dd2

where dd3 denotes the relative bias projections. Cross-attention to encoder memory dd4 is unchanged from standard Transformer formulations.

4. Decoding Algorithm and Traversal Strategies

ASTormer supports multiple traversal paradigms and node selection strategies to maintain flexibility and performance:

  • Frontier Management:
    • DFS (Depth-First Search): Utilize a stack to push children of the currently expanded node.
    • BFS (Breadth-First Search): Employ a queue to enqueue children at the bottom.
  • Node Selection:
    • L2R (Left-to-Right): Deterministically choose nodes following the schema order.
    • Random: Uniformly sample one node during training, with all options expanded in the beam during inference.

The generic decoding algorithm initializes the AST with a root and maintains a beam of dd5 partial trees. At each step, candidate expansions are produced for each active frontier node by predicting the action distribution dd6 and all valid actions for the node. Actions are scored and the top dd7 beams are retained for the next step. This procedure guarantees compatibility with various exploration regimes across training and inference.

5. Action Types and Output Integration

At each decoding step, the output module chooses from three action types tailored to grammar-based SQL AST construction:

  • ApplyRule: Expands a nonterminal via a production rule.
  • SelectItem: Selects schema elements (e.g., tables, columns).
  • GenToken: Generates terminal tokens as required.

Each action is conditioned on the decoded node representation dd8. The symbolic AST is updated accordingly, ensuring syntactic well-formedness throughout the decoding process.

6. Empirical Evaluation

Extensive experiments on five text-to-SQL benchmarks demonstrate that ASTormer surpasses previous RNN-based grammar decoders in both effectiveness—measured by execution accuracy and logical consistency—and efficiency, owing to its parallelizable and structure-aware Transformer design. The architecture is fully compatible with state-of-the-art encoder modules and supports a wide range of traversal and node selection variants (Cao et al., 2023).

A plausible implication is that ASTormer's explicit encoding of tree structure through both absolute and relative signals enables superior syntactic generalization and increases decoding efficiency compared to approaches without such priors.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ASTormer.