Papers
Topics
Authors
Recent
2000 character limit reached

Hierarchical Transformer-Based Pointer Network

Updated 5 December 2025
  • The paper introduces an advanced neural architecture that combines a Transformer backbone with causal hierarchical attention adapters and a pointer mechanism for enhanced structure modeling.
  • The integrated pointer network explicitly resolves coreference and reentrancy by pointing to antecedent positions, improving accuracy in structured sequence-to-sequence tasks.
  • Empirical results show state-of-the-art performance on AMR benchmarks and robust generalization to out-of-domain data, highlighting its practical impact on graph parsing.

A Hierarchical Transformer-Based Pointer Network is an advanced neural architecture designed to integrate structural locality and explicit coreference resolution into sequence-to-sequence tasks, exemplified by the CHAP model for AMR parsing. CHAP (Causal Hierarchical Attention and Pointers) combines a pre-trained Transformer backbone with novel architectural components: causal hierarchical attention adapters and an integrated pointer mechanism. These enhancements facilitate the modeling of hierarchical, tree-like structures while explicitly capturing coreference relations through targeted pointing operations. The architecture is optimized for generating linearized graph representations, such as Abstract Meaning Representation (AMR), achieving high fidelity in both structure prediction and reentrancy annotation (Lou et al., 2023).

1. Model Architecture

The architecture builds upon an off-the-shelf BART encoder, which converts the source sentence xx into a hidden-state sequence HE=(h1E,,hLE)H^E = (h^E_1, \ldots, h^E_L). The decoder is initialized from pretrained BART and is architecturally augmented in several ways:

  • Pointer-Encoder Module: For each target position ii, the input incorporates standard token embeddings Et(ti)E_t(t_i), positional embeddings Epos(i)E_{pos}(i), and a learned pointer embedding Ep(pi)E_p(p_i). The pointer target pip_i is 1-1 for non-pointing tokens or indexes an antecedent position otherwise, and is embedded as

Ep(pi)={0,pi=1 MLPp([Et(tpi);Epos(pi)]),pi0E_p(p_i) = \begin{cases} \mathbf{0}, & p_i = -1 \ \mathrm{MLP}_p([E_t(t_{p_i}); E_{pos}(p_i)]), & p_i \ge 0 \end{cases}

yielding per-position input hi0=Et(ti)+Epos(i)+Ep(pi)h^0_i = E_t(t_i) + E_{pos}(i) + E_p(p_i).

  • Causal Hierarchical Attention (CHA) Adapters: Each decoder layer receives an adapter block that applies hierarchical attention masks MCHAM_{\mathrm{CHA}} implementing tree-structured visibility constraints, either in parallel or following standard self-attention.
  • Pointer-Net Readout: During decoding, a specified subset HptrH_{\mathrm{ptr}} of self-attention heads (e.g., 4 heads in the top layer) is tasked with computing raw attention logits AihRNA^h_i \in \mathbb{R}^N over previous positions j=1i1j=1\ldots i-1. The pointer distribution is

Pptr(ji)=1HptrhHptrsoftmax(Aih)j(1j<i)P_{\mathrm{ptr}}(j \mid i) = \frac{1}{|H_{\mathrm{ptr}}|} \sum_{h \in H_{\mathrm{ptr}}} \mathrm{softmax}(A^h_i)_j \quad (1 \le j < i)

with the remainder of the decoder heads producing the standard token vocabulary softmax.

This modular structure enables the decoder to jointly model both token sequence prediction and hierarchical pointer-based reentrancy.

2. Causal Hierarchical Attention Mechanism

Causal Hierarchical Attention (CHA) is a masking paradigm applied at the decoder layer to enforce explicit hierarchical structure and local token dependencies. During sequence generation, an explicit stack SS is maintained, representing currently "visible" positions:

  • Expand Step: For non-compositional tokens, the current index ii is pushed onto SS, and MCHA[i,j]=0M_{\mathrm{CHA}}[i,j]=0 is set for all jSj \in S, allowing attention to the entire stack.
  • Compose Step: For a subtree completion token (denoted by ")", the stack is popped until the matching opening token is reached (position kk). MCHA[i,k]=0M_{\mathrm{CHA}}[i,k]=0 and MCHA[i,j]=0M_{\mathrm{CHA}}[i,j]=0 for all k<j<ik < j < i, after which ii is pushed onto SS.
  • Mask Structure: All other entries remain at -\infty to block attention.

Multiple CHA variants are implemented:

  • Top-Down ("↓↑single"): Standard compose/expand as described.
  • Top-Down Double ("↓↑double"): Compose and expand separated into distinct steps.
  • Bottom-Up ("↑"): Uses pointer-based subtree boundaries, with special tokens marking subtrees.

Within each decoder layer, a subset of attention heads operates under CHA-masked self-attention, preserving tree structure locality and reducing spurious dependencies during graph linearization.

3. Decoder Pointer Mechanism

At each generation step ii, the decoder yields two distributions:

  • Token Distribution: Ptok(tit<i,x)P_{\mathrm{tok}}(t_i \mid t_{<i}, x) produced via the usual self-attention and cross-attention pathway.
  • Pointer Distribution: Pptr(.i)P_{\mathrm{ptr}}(. \mid i) as the average over specialized pointer heads.

Pointer supervision is applied only to positions with pointers (pi0p_i \ge 0), using the loss:

Lpointer=i:pi0logPptr(pii)L_{\mathrm{pointer}} = - \sum_{i:p_i\ge0} \log P_{\mathrm{ptr}}(p_i \mid i)

Thus, the model is incentivized to correctly identify and point to antecedent positions in the output sequence corresponding to coreferential nodes or reentrant subgraphs.

4. Training Protocol and Objective

Model parameters are updated to minimize a weighted sum of the standard sequence cross-entropy and the pointer loss:

L=Lseq2seq+αLpointerL = L_{\mathrm{seq2seq}} + \alpha L_{\mathrm{pointer}}

where

Lseq2seq=i=1NlogPtok(tit<i,x)L_{\mathrm{seq2seq}} = - \sum_{i=1}^N \log P_{\mathrm{tok}}(t_i \mid t_{<i}, x)

CHAP uses α=0.075\alpha=0.075 to balance the pointer loss component. Training is conducted for 50,000 steps with batch size 16, AdamW optimizer, and a cosine learning rate schedule with 5,000 warm-up steps. The base model leverages BART-base, modifying all 12 decoder layers; the large model applies CHA adapters only to the top 2 decoder layers.

5. Empirical Results and Evaluation

CHAP is evaluated on AMR 2.0 and 3.0 benchmarks (standard splits: train, dev, test), as well as out-of-domain (OOD) sets including The Little Prince (TLP), BioAMR, and New3. The main metric is Smatch, averaged over 3 runs. Fine-grained metrics include NoWSD, Wikification, Concepts, NER, Negation, Unlabeled, Reentrancy, and SRL (all F1 scores).

Key experimental findings:

  • CHAP achieves state-of-the-art or near state-of-the-art Smatch scores on in-domain AMR 2.0/3.0 without external alignment or silver data.
  • Competitive out-of-domain generalization is observed (TLP, BioAMR).
  • The CHA mask and explicit pointer network are critical for modeling recursive, tree-structured locality and coreference.

6. Significance and Context

Translation-based AMR parsers have previously sidestepped explicit structure modeling, treating target graphs as free text. This approach lacks inductive bias for structural locality and introduces suboptimal handling of coreferences via token insertions. CHAP’s architectural advances—especially the CHA masking and pointer mechanism—directly address these limitations, enabling Transformer decoders to represent graph-structured data with hierarchical locality and explicit reentrancy.

A plausible implication is that similar hierarchical pointer-based enhancements can be adapted to other structured prediction tasks where recursive dependencies and coreferential links are prominent. The empirical performance on both in-domain and OOD benchmarks highlights the robustness and generalizability of integrating CHA and pointer modules in Transformer architectures (Lou et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Hierarchical Transformer-Based Pointer Network.