TextGraphFuseGAT: Vietnamese Token Classification

Updated 21 December 2025

The paper presents TextGraphFuseGAT which combines PhoBERT embeddings with a fully connected GAT layer and a Transformer decoder for token-level classification.
It demonstrates strong empirical performance on Vietnamese NER and disfluency detection tasks, outperforming PhoBERT-only baselines.
The model effectively captures global semantic context and explicit token relations, addressing challenges in domain-specific vocabularies and complex entity boundaries.

TextGraphFuseGAT is a neural architecture for Vietnamese token-level classification that combines pretrained transformer representations from PhoBERT with a fully connected Graph Attention Network (GAT) layer and a Transformer-style self-attention fusion mechanism. It is designed to capture both deep contextual embeddings and explicit token-level relational information, and has demonstrated strong empirical performance on several Vietnamese sequence-labeling benchmarks, including specialized medical and speech datasets (Nguyen, 13 Oct 2025).

1. Architecture Overview

TextGraphFuseGAT processes a sentence of $n$ tokens $(x_1, x_2, \dots, x_n)$ using a multi-stage neural pipeline. First, tokens are byte-pair encoded and passed through PhoBERT to yield final hidden representations: $H^{(0)} = [h_1^{(0)}, h_2^{(0)}, \dots, h_n^{(0)}] \in \mathbb{R}^{n \times d}$ where $d=1024$ is PhoBERT $_{\mathrm{large}}$ 's hidden size. A fully connected graph $G = (V, E)$ over all $n$ tokens is constructed, forming the basis for relational modeling. A multi-head GAT layer operates on $H^{(0)}$ , outputting graph-enhanced embeddings. These embeddings are further refined by a TransformerDecoderLayer incorporating both self- and cross-attention before final classification via a tokenwise softmax head.

This fusion allows the model to combine PhoBERT's global semantic context, token-to-token interactions from the GAT, and re-attention mechanisms from the Transformer component.

2. Input Encoding and Graph Construction

The PhoBERT encoder tokenizes input sequences at the sub-word level and produces contextualized embeddings per token, indexed as $h_i^{(0)}$ . Over these, TextGraphFuseGAT constructs a fully connected directed graph: $V = \{1, \dots, n\},\quad E = \{(i, j) : i, j \in V\} \cup \{(i, i)\}$ This results in an adjacency matrix $A \in \{0,1\}^{n \times n}$ with $A_{ij} = 1$ for all $i, j$ , forming a dense graph. No explicit adjacency normalization is performed at this stage; normalization is handled within the GAT layer.

The fully connected topology ensures that all token pairs are considered for relational attention, providing the capacity to encode sentence-wide dependencies beyond sequential context.

3. Graph Attention and Transformer Fusion

The GAT layer applies $K$ attention heads to the token representations. For each head $k$ , GAT computes: $e_{ij}^{(k)} = \mathrm{LeakyReLU}\Bigl( \mathbf{a}_k^\top \big[ W_k h_i^{(l)} \Vert W_k h_j^{(l)} \big] \Bigr),\quad \alpha_{ij}^{(k)} = \frac{\exp(e_{ij}^{(k)})}{\sum_{m=1}^n \exp(e_{im}^{(k)})}$

$\widetilde h_i^{(k,l+1)} = \sigma\Bigl( \sum_{j=1}^n \alpha_{ij}^{(k)} W_k h_j^{(l)} \Bigr)$

The outputs from all heads are concatenated: $h_i^{(l+1)} = \big\|_{k=1}^K \widetilde h_i^{(k,l+1)} \in \mathbb{R}^{K d'}$

For NER tasks, $K=8$ and $d'=32$ ; for disfluency detection, $K=4$ and $d'=64$ . These choices control the size of the input to the subsequent Transformer decoder ( $D = K d'$ ).

The graph-enhanced embeddings are then input to a standard PyTorch TransformerDecoderLayer, which applies multi-head self-attention and cross-attention, followed by residual connections, layer normalization, and a two-layer feed-forward network. This Transformer layer serves to further contextualize the graph outputs and model higher-order dependencies.

4. Classification Module and Loss

The output of the Transformer decoder for each token, $h_i^{\mathrm{dec}}$ , is passed through a linear projection and softmax: $y_i = \mathrm{softmax}(W_o h_i^{\mathrm{dec}} + b_o) \in \mathbb{R}^C$ where $C$ is the number of BIO-labeled classes. Training uses the standard cross-entropy loss, summing over all tokens and classes, while ignoring subword and padding indices: $\mathcal{L} = -\sum_{i=1}^n \sum_{c=1}^C \mathbf{1}\{y_i = c\} \log(\hat y_{i, c})$ The model is optimized end-to-end with AdamW optimizer, incorporating dropout (0.3) on attention and hidden representations, weight decay of 0.01, gradient clipping at 1.0, and a warm-up ratio of 0.1.

5. Hyperparameters, Training Regimen, and Datasets

The backbone is PhoBERT $_{\mathrm{large}}$ (jointly fine-tuned). GAT and decoder head counts are set to 8 ( $K{=}8$ ) for NER, and 4 for disfluency detection. Transformer decoder output dimensionality is $D=256$ with 4 attention heads. Learning rates are task-dependent: $5 \times 10^{-5}$ (PhoNER), $3 \times 10^{-5}$ (VietMed-NER), $2 \times 10^{-5}$ (Disfluency). Training uses batch size 16, maximum sequence length 128, for 15 epochs (NER) or 10 epochs (Disfluency), with early stopping based on validation Micro-F1.

Benchmarked datasets:

PhoNER-COVID19: Word-level named entity recognition in the COVID-19 domain.
PhoDisfluency: Speech disfluency detection.
VietMed-NER: Vietnamese medical spoken named entity recognition; features 18 entity types with BIO tagging and is notable for domain-specific vocabulary and expressions.

6. Empirical Results and Ablation Analysis

TextGraphFuseGAT achieves superior performance across domains:

PhoNER-COVID19: Micro-F1 = 0.984, Macro-F1 = 0.958
PhoDisfluency: RM-F1 = 0.978, IM-F1 = 0.993, Micro-F1 = 0.994
VietMed-NER: Precision = 0.892, Recall = 0.893, F1 = 0.893

On VietMed-NER, ablation results highlight the critical contributions of each architectural component: | Model | Precision | Recall | F1 | |------------------------|-----------|--------|-------| | PhoBERT-only | 0.690 | 0.770 | 0.730 | | PhoBERT + GAT | 0.889 | 0.891 | 0.890 | | Full (+Decoder) | 0.892 | 0.893 | 0.893 |

The introduction of the GAT module yields an F1 increase of +0.160 over PhoBERT-only, evidencing the impact of explicit relational modeling. The additional Transformer decoder provides smaller Micro-F1 gains but noticeably improves Macro-F1 and performance on rare entity types. This suggests that the decoder is particularly effective at modeling long-range and fine-grained dependencies (Nguyen, 13 Oct 2025).

7. Significance and Domain Impact

TextGraphFuseGAT demonstrates that explicitly modeling token-token relations via a fully connected GAT, fused with deep pretrained transformer embeddings, enhances token classification performance, particularly in domains with domain-specific vocabulary and complex entity boundaries such as medical and conversational speech. The ablation analysis substantiates that each component—PhoBERT backbone, graph attention, and Transformer decoder—provides distinct and additive contributions to the overall system accuracy.

Its evaluation on VietMed-NER, the first Vietnamese spoken medical NER dataset, establishes both the domain challenge and the method's capacity for robust generalization. A plausible implication is that similar fusion strategies could be adapted to other languages or domains where sequence-level and relational context must be jointly modeled (Nguyen, 13 Oct 2025).

Markdown Upgrade to Chat

References (1)

An Encoder-Integrated PhoBERT with Graph Attention for Vietnamese Token-Level Classification (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to TextGraphFuseGAT.