Hierarchical Transformer-Based Pointer Network

Updated 5 December 2025

The paper introduces an advanced neural architecture that combines a Transformer backbone with causal hierarchical attention adapters and a pointer mechanism for enhanced structure modeling.
The integrated pointer network explicitly resolves coreference and reentrancy by pointing to antecedent positions, improving accuracy in structured sequence-to-sequence tasks.
Empirical results show state-of-the-art performance on AMR benchmarks and robust generalization to out-of-domain data, highlighting its practical impact on graph parsing.

A Hierarchical Transformer-Based Pointer Network is an advanced neural architecture designed to integrate structural locality and explicit coreference resolution into sequence-to-sequence tasks, exemplified by the CHAP model for AMR parsing. CHAP (Causal Hierarchical Attention and Pointers) combines a pre-trained Transformer backbone with novel architectural components: causal hierarchical attention adapters and an integrated pointer mechanism. These enhancements facilitate the modeling of hierarchical, tree-like structures while explicitly capturing coreference relations through targeted pointing operations. The architecture is optimized for generating linearized graph representations, such as Abstract Meaning Representation (AMR), achieving high fidelity in both structure prediction and reentrancy annotation (Lou et al., 2023).

1. Model Architecture

The architecture builds upon an off-the-shelf BART encoder, which converts the source sentence $x$ into a hidden-state sequence $H^E = (h^E_1, \ldots, h^E_L)$ . The decoder is initialized from pretrained BART and is architecturally augmented in several ways:

Pointer-Encoder Module: For each target position $i$ , the input incorporates standard token embeddings $E_t(t_i)$ , positional embeddings $E_{pos}(i)$ , and a learned pointer embedding $E_p(p_i)$ . The pointer target $p_i$ is $-1$ for non-pointing tokens or indexes an antecedent position otherwise, and is embedded as

$E_p(p_i) = \begin{cases} \mathbf{0}, & p_i = -1 \ \mathrm{MLP}_p([E_t(t_{p_i}); E_{pos}(p_i)]), & p_i \ge 0 \end{cases}$

yielding per-position input $h^0_i = E_t(t_i) + E_{pos}(i) + E_p(p_i)$ .

Causal Hierarchical Attention (CHA) Adapters: Each decoder layer receives an adapter block that applies hierarchical attention masks $M_{\mathrm{CHA}}$ implementing tree-structured visibility constraints, either in parallel or following standard self-attention.
Pointer-Net Readout: During decoding, a specified subset $H_{\mathrm{ptr}}$ of self-attention heads (e.g., 4 heads in the top layer) is tasked with computing raw attention logits $A^h_i \in \mathbb{R}^N$ over previous positions $j=1\ldots i-1$ . The pointer distribution is

$P_{\mathrm{ptr}}(j \mid i) = \frac{1}{|H_{\mathrm{ptr}}|} \sum_{h \in H_{\mathrm{ptr}}} \mathrm{softmax}(A^h_i)_j \quad (1 \le j < i)$

with the remainder of the decoder heads producing the standard token vocabulary softmax.

This modular structure enables the decoder to jointly model both token sequence prediction and hierarchical pointer-based reentrancy.

2. Causal Hierarchical Attention Mechanism

Causal Hierarchical Attention (CHA) is a masking paradigm applied at the decoder layer to enforce explicit hierarchical structure and local token dependencies. During sequence generation, an explicit stack $S$ is maintained, representing currently "visible" positions:

Expand Step: For non-compositional tokens, the current index $i$ is pushed onto $S$ , and $M_{\mathrm{CHA}}[i,j]=0$ is set for all $j \in S$ , allowing attention to the entire stack.
Compose Step: For a subtree completion token (denoted by ")", the stack is popped until the matching opening token is reached (position $k$ ). $M_{\mathrm{CHA}}[i,k]=0$ and $M_{\mathrm{CHA}}[i,j]=0$ for all $k < j < i$ , after which $i$ is pushed onto $S$ .
Mask Structure: All other entries remain at $-\infty$ to block attention.

Multiple CHA variants are implemented:

Top-Down ("↓↑single"): Standard compose/expand as described.
Top-Down Double ("↓↑double"): Compose and expand separated into distinct steps.
Bottom-Up ("↑"): Uses pointer-based subtree boundaries, with special tokens marking subtrees.

Within each decoder layer, a subset of attention heads operates under CHA-masked self-attention, preserving tree structure locality and reducing spurious dependencies during graph linearization.

3. Decoder Pointer Mechanism

At each generation step $i$ , the decoder yields two distributions:

Token Distribution: $P_{\mathrm{tok}}(t_i \mid t_{<i}, x)$ produced via the usual self-attention and cross-attention pathway.
Pointer Distribution: $P_{\mathrm{ptr}}(. \mid i)$ as the average over specialized pointer heads.

Pointer supervision is applied only to positions with pointers ( $p_i \ge 0$ ), using the loss:

$L_{\mathrm{pointer}} = - \sum_{i:p_i\ge0} \log P_{\mathrm{ptr}}(p_i \mid i)$

Thus, the model is incentivized to correctly identify and point to antecedent positions in the output sequence corresponding to coreferential nodes or reentrant subgraphs.

4. Training Protocol and Objective

Model parameters are updated to minimize a weighted sum of the standard sequence cross-entropy and the pointer loss:

$L = L_{\mathrm{seq2seq}} + \alpha L_{\mathrm{pointer}}$

where

$L_{\mathrm{seq2seq}} = - \sum_{i=1}^N \log P_{\mathrm{tok}}(t_i \mid t_{<i}, x)$

CHAP uses $\alpha=0.075$ to balance the pointer loss component. Training is conducted for 50,000 steps with batch size 16, AdamW optimizer, and a cosine learning rate schedule with 5,000 warm-up steps. The base model leverages BART-base, modifying all 12 decoder layers; the large model applies CHA adapters only to the top 2 decoder layers.

5. Empirical Results and Evaluation

CHAP is evaluated on AMR 2.0 and 3.0 benchmarks (standard splits: train, dev, test), as well as out-of-domain (OOD) sets including The Little Prince (TLP), BioAMR, and New3. The main metric is Smatch, averaged over 3 runs. Fine-grained metrics include NoWSD, Wikification, Concepts, NER, Negation, Unlabeled, Reentrancy, and SRL (all F1 scores).

Key experimental findings:

CHAP achieves state-of-the-art or near state-of-the-art Smatch scores on in-domain AMR 2.0/3.0 without external alignment or silver data.
Competitive out-of-domain generalization is observed (TLP, BioAMR).
The CHA mask and explicit pointer network are critical for modeling recursive, tree-structured locality and coreference.

6. Significance and Context

Translation-based AMR parsers have previously sidestepped explicit structure modeling, treating target graphs as free text. This approach lacks inductive bias for structural locality and introduces suboptimal handling of coreferences via token insertions. CHAP’s architectural advances—especially the CHA masking and pointer mechanism—directly address these limitations, enabling Transformer decoders to represent graph-structured data with hierarchical locality and explicit reentrancy.

A plausible implication is that similar hierarchical pointer-based enhancements can be adapted to other structured prediction tasks where recursive dependencies and coreferential links are prominent. The empirical performance on both in-domain and OOD benchmarks highlights the robustness and generalizability of integrating CHA and pointer modules in Transformer architectures (Lou et al., 2023).

PDF Markdown Chat (Pro)

References (1)

AMR Parsing with Causal Hierarchical Attention and Pointers (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Hierarchical Transformer-Based Pointer Network.