Quantum Graph Transformer (QGT)

Updated 1 April 2026

Quantum Graph Transformer (QGT) is a neural architecture that fuses quantum computing with transformer models for enhanced graph-structured data processing.
It leverages quantum algorithms and Hamiltonian models, such as Ising and Heisenberg, to compute global, expressive structural representations.
Empirical results show QGTs achieve superior performance on tasks ranging from molecular prediction to NLP, outperforming classical methods in expressivity and efficiency.

A Quantum Graph Transformer (QGT) is a category of neural architectures for graph-structured data that fuses principles from quantum information processing with modern transformer models. QGTs leverage quantum-computed features to generate graph-wide structural representations—particularly for the aggregation or attention mechanisms central to transformer-based graph neural networks. This approach enables access to expressive, global graph features that are theoretically and practically inaccessible to classical architectures, especially in capturing complex structural distinctions and entanglement patterns. QGTs have been instantiated in multiple forms, including through quantum-computed aggregation matrices, quantum walk encodings, quantum positional encodings, or hybrid quantum-classical attention modules, and have demonstrated superior or competitive performance on synthetic, chemical, and real-world graph learning tasks (Thabet et al., 2022, Thabet et al., 2023, Thabet et al., 2024, Yu et al., 2024, Aktar et al., 9 Jun 2025, Tousi et al., 5 Nov 2025).

1. Core Architectural Principles

The defining aspect of a QGT is the use of quantum algorithms to generate structural graph features or aggregation weights, which are then used within a transformer-style neural network. The canonical architecture processes an input graph $G = (V, E)$ with $N = |V|$ nodes, each associated to a feature vector $x_i \in \mathbb{R}^d$ , and possibly edge features $e_{ij}$ (Thabet et al., 2022). Each QGT layer typically involves two phases:

Classical projection: Node features are projected via learnable weights: $Q^l = H^l W_Q$ , $K^l = H^l W_K$ , $V^l = H^l W_V$ .
Quantum aggregation (attention): A quantum processor (or quantum-inspired simulation) computes an $N \times N$ matrix $A(\theta)$ whose entries reflect quantum correlations (e.g., two-body Pauli operator expectations) derived from a quantum system whose Hamiltonian encodes the topology of $G$ . The node update takes the form:

$N = |V|$ 0

where $N = |V|$ 1 is a nonlinearity and $N = |V|$ 2 denotes concatenation.

Multi-head attention variants assign independent quantum circuits and classical projections $N = |V|$ 3 to each head. The architectural novelty is isolated to generation and use of the matrix $N = |V|$ 4; the rest of the transformer remains classical (Thabet et al., 2022, Thabet et al., 2023).

2. Quantum Structural Feature Construction

Quantum aggregation in QGTs is instantiated by mapping the graph $N = |V|$ 5 onto a quantum Hamiltonian $N = |V|$ 6. Three canonical models are:

Ising model: $N = |V|$ 7
XY model: $N = |V|$ 8
Heisenberg (XXZ) model: $N = |V|$ 9

Here, $x_i \in \mathbb{R}^d$ 0 are Pauli operators on qubit $x_i \in \mathbb{R}^d$ 1. In the standard process, a parameterized quantum circuit of the form

$x_i \in \mathbb{R}^d$ 2

with $x_i \in \mathbb{R}^d$ 3 (the mixing Hamiltonian) generates the quantum state, whose two-point correlators $x_i \in \mathbb{R}^d$ 4 supply the “quantum attention” features: $x_i \in \mathbb{R}^d$ 5 These are linearly combined using a trainable vector $x_i \in \mathbb{R}^d$ 6 and possibly softmaxed for normalization: $x_i \in \mathbb{R}^d$ 7 (Thabet et al., 2022).

Quantum features can also be constructed as positional encodings using continuous-time quantum walks, ground-state correlators, or k-particle restricted dynamics (Thabet et al., 2023, Thabet et al., 2024, Yu et al., 2024). These structural features—either as per-pair encodings or attention biases—demonstrably offer increased expressive power over classical graph encodings.

3. Variants and Extensions

The QGT paradigm supports several architectural instantiations and related quantum-graph transformer designs:

Variant	Quantum Feature Type	Notable Application
Quantum-Computed Aggregation (Thabet et al., 2022)	Two-body correlators from QPU	Molecular property prediction, WL-hard graph classification
Quantum Positional Encodings (Thabet et al., 2023, Thabet et al., 2024)	Quantum walk, ground-state correlators	General graph benchmarks
Quantum Walk Transformer (Yu et al., 2024)	Discrete-time quantum walk encodings	Social/chemical graph classification
QGT for NLP (Aktar et al., 9 Jun 2025)	PQC-based quantum self-attention	Sentiment classification
Dual Attention QGT (Tousi et al., 5 Nov 2025)	Classical attention paths, not QPU	Quantum error mitigation

GQWformer (Yu et al., 2024) uses quantum walk-based encodings as additive biases in the transformer's self-attention logits, augmented by a recurrent module to model local structural sequence information. In hybrid QGTs, fixed (randomized) quantum circuits generate attention weights; only the classical heads and output projections are trained, which reduces the demand on quantum hardware and accelerates training (Thabet et al., 2022).

In NLP, QGTs incorporate PQC-based quantum self-attention for graph-represented language data, substantially reducing parameter footprints relative to classical transformers and boosting sample efficiency in supervised tasks (Aktar et al., 9 Jun 2025).

4. Theoretical Expressivity and Motivation

Classical message passing neural networks (MPNNs) and many graph transformers are restricted in expressivity by the Weisfeiler-Lehman (WL) test hierarchy, leaving them incapable of distinguishing certain non-isomorphic graph pairs. QGTs, by virtue of embedding global, long-range quantum correlations—including those stemming from multi-particle interference and the structure of quantum ground states—can break symmetries that foil classical methods (Thabet et al., 2023, Thabet et al., 2022). For example, ground states of antiferromagnetic Ising Hamiltonians can differ on WL-equivalent graphs, and two-particle quantum walks can distinguish strongly regular graphs that random-walk-based methods cannot.

The quantum walk dynamics on the lifted $x_i \in \mathbb{R}^d$ 8-particle subspace simulate hard-core boson walks, features unavailable to classical random walks unless one incurs infeasible computational cost. This underpins the strictly greater power of quantum attention mechanisms as aggregation/compositional operators for graph data.

5. Empirical Performance and Evaluation

QGTs outperform, match, or in some regimes underperform strong classical graph models depending on the task and dataset:

Synthetic benchmarks: On the “GraphCovers” task, QGT achieves near-zero training loss in $x_i \in \mathbb{R}^d$ 920 epochs, where baseline graph transformers fail to resolve loss below 0.1 (Thabet et al., 2022).
Chemoinformatics: On QM7/QM9 datasets, QGT equals or surpasses GCN/SAGE/GAT baselines in root-mean-square error. The “randomized QGT” variant achieves a $e_{ij}$ 050% lower error than classical counterparts on QM9 (Thabet et al., 2022).
NLP tasks: On five sentiment classification datasets, PQC-based QGTs exhibit 4–6% accuracy gains over classical graph transformers, with nearly 50% lower sample requirements for similar accuracy on Yelp data (Aktar et al., 9 Jun 2025).
Quantum error mitigation: QAGT-MLP, with dual-path attention (global and lightcone), yields 45–60% lower mean error and error variance than random forest baselines under matched quantum shot budgets (Tousi et al., 5 Nov 2025).
General graph classification: GQWformer consistently exceeds prior state-of-the-art methods; on PTC, the full quantum-walk+RNN design outperforms both ablated or classical baselines by 2.3–4.6 accuracy points (Yu et al., 2024).
Limitations: On the IAM letters dataset and small regular graphs, QGT performance can fall below that of optimized classical architectures, likely due to parameterization noise or irrelevance of global quantum features for simple topologies (Thabet et al., 2022).

6. Implementation and Scalability Considerations

QGT deployment requires a quantum processing unit (QPU) with $e_{ij}$ 1 qubits to match graph sizes; Trotterized Hamiltonian evolution and measurement of two-point correlators must be feasible within hardware coherence times. Simulations beyond 20–25 qubits are resource-intensive ( $e_{ij}$ 21 million amplitudes for $e_{ij}$ 3), with runtime and memory scaling exponentially with $e_{ij}$ 4, layer depth $e_{ij}$ 5, and number of attention heads (Thabet et al., 2022, Thabet et al., 2024).

A pragmatic solution is “randomized QGT,” where quantum circuit parameters $e_{ij}$ 6 are fixed and only classical heads are trained, drastically lightening quantum resource demands and enabling scaling to larger datasets. Pre-computation of quantum features permits downstream training and inference to remain entirely classical, making QGTs feasible for real-world large-scale experiments (Thabet et al., 2024).

Training quantum parameters is done via parameter-shift rules or finite-difference, but convergence is slow due to the difficulty of optimizing through quantum circuit outputs. Edge-level and node-level quantum positional encodings should be pre-computed and injected as attention biases or feature concatenations; batch size and optimizer remain standard from classical deep learning practice (Thabet et al., 2024).

7. Future Directions

Ongoing research explores:

End-to-end training of quantum parameters: Joint optimization of quantum circuit angles with classical parameters via parameter-shift differentiation or warm starts from classical GNNs (Thabet et al., 2023).
Higher-order and hierarchical quantum encodings: Including $e_{ij}$ 7-particle quantum walks for $e_{ij}$ 8, local/hierarchical QPEs for large graphs, and entangled initial states for richer global features (Thabet et al., 2024).
Hardware demonstration and noise effects: Experiments on neutral-atom or superconducting QPUs to test quantum advantage in regimes inaccessible to classical simulation.
Integration with spectral/simplicial methods: For example, using correlations from ground states of QAOA or other quantum algorithms as additional structural features.
Quantum-Classical co-training: Alternating training rounds between quantum circuit parameter refinement and classical model adaptation to mitigate the risk of barren plateaus or quantum noise impacting learning (Thabet et al., 2024).

These directions are motivated by the findings that quantum-induced structural priors can offer superior generalization, sample efficiency, and distinguishing power in graph learning tasks, especially in domains where global connectivity patterns and high-order correlations dominate.

References

(Thabet et al., 2022) Extending Graph Transformers with Quantum Computed Aggregation
(Thabet et al., 2023) Enhancing Graph Neural Networks with Quantum Computed Encodings
(Thabet et al., 2024) Quantum Positional Encodings for Graph Neural Networks
(Yu et al., 2024) GQWformer: A Quantum-based Transformer for Graph Representation Learning
(Aktar et al., 9 Jun 2025) Quantum Graph Transformer for NLP Sentiment Classification
(Tousi et al., 5 Nov 2025) QAGT-MLP: An Attention-Based Graph Transformer for Small and Large-Scale Quantum Error Mitigation