Papers
Topics
Authors
Recent
Search
2000 character limit reached

LoRAP: Low-Rank Aggregation Prompting

Updated 28 January 2026
  • LoRAP is a prompt-based technique for mitigating quantization errors in graph neural networks by injecting lightweight, low-rank prompts at the aggregation level.
  • It applies post-aggregation prompting to decouple error correction from node features, leading to targeted and efficient quantization error recovery.
  • Integrated into quantization-aware training pipelines, LoRAP delivers significant performance gains with minimal additional parameters and computational overhead.

Low-Rank Aggregation Prompting (LoRAP) is a prompt-based technique for mitigating quantization errors in quantized graph neural networks (GNNs), designed to optimize quantization-aware training (QAT) pipelines while maintaining negligible computational overhead and minimal additional parameter count. LoRAP works by injecting lightweight, input-dependent, low-rank prompts into the quantized aggregation step of GNN message passing, thereby directly correcting quantization errors at the aggregation level rather than at the node-feature level. This design achieves theoretically grounded, empirically validated improvements in low-bit quantized GNN performance across a broad set of architectures and datasets, while also offering extensibility to other neural architectures and quantization regimes (Liu et al., 21 Jan 2026).

1. Mathematical Formulation of Low-Rank Aggregation Prompting

The LoRAP methodology operates within the canonical message-passing formulation for GNNs. At each layer ll, node features H(l)RN×d\mathbf{H}^{(l)} \in \mathbb{R}^{N \times d} are updated based on an aggregation of their neighbors, via

S(l)=[jN(i)ϕ(hi(l),hj(l),eij)]i=1N,\mathbf{S}^{(l)} = \left[ \sum_{j \in \mathcal{N}(i)} \phi(\mathbf{h}_i^{(l)}, \mathbf{h}_j^{(l)}, \mathbf{e}_{ij}) \right]_{i=1}^N,

with ϕ\phi as the message function and N(i)\mathcal{N}(i) the neighborhood of node ii. In QAT, all real-valued tensors (features, weights, etc.) undergo fake quantization: Q(X)Q(X) for quantization, DQ(Xq)DQ(X_q) for dequantization, parameterized by learned scale SS and zero-point ZZ.

LoRAP inserts a layer-wise, low-rank, input-dependent prompt into the dequantized aggregation result. Specifically, at each layer ll, the prompt basis is parameterized as:

P(l)=PA(l)PB(l),PA(l)Rk×r,PB(l)Rr×d\mathbf{P}^{(l)} = \mathbf{P}_A^{(l)} \mathbf{P}_B^{(l)},\quad \mathbf{P}_A^{(l)} \in \mathbb{R}^{k \times r},\quad \mathbf{P}_B^{(l)} \in \mathbb{R}^{r \times d}

where kNk \ll N, rdr \ll d, yielding kk prompt vectors of dimension dd. A shared trainable mapping ϕ:RdRk\phi:\mathbb{R}^d\rightarrow\mathbb{R}^k produces, for each node ii, a row-softmaxed set of coefficients αi(l)\boldsymbol{\alpha}_i^{(l)}. The input-dependent prompt for node ii is then

pi(l)=[αi(l)]P(l),withαi(l)=Softmax(ϕ(si(l))).\mathbf{p}_i^{(l)} = \left[\boldsymbol{\alpha}_i^{(l)}\right]^\top \mathbf{P}^{(l)}, \quad \text{with} \quad \boldsymbol{\alpha}_i^{(l)} = \text{Softmax}(\phi(\mathbf{s}_i^{(l)})).

The prompted aggregation is formed as

Ps(l)=Softmax(ϕ(S(l)^))PA(l)PB(l),\mathbf{P}_s^{(l)} = \text{Softmax}(\phi(\widehat{\mathbf{S}^{(l)}})) \cdot \mathbf{P}_A^{(l)} \mathbf{P}_B^{(l)},

where the softmax is applied to each row. The sum S(l)^+Ps(l)\widehat{\mathbf{S}^{(l)}} + \mathbf{P}_s^{(l)} (with S(l)^\widehat{\mathbf{S}^{(l)}} as dequantized aggregation) is requantized:

Sq(l)=Q(S(l)^+Ps(l)),\mathbf{S}_q^{(l)*} = Q\left(\widehat{\mathbf{S}^{(l)}} + \mathbf{P}_s^{(l)}\right),

which replaces Sq(l)\mathbf{S}_q^{(l)} in subsequent GNN computations. The result is a lightweight, expressive, node-specific correction for quantization error across message-passing layers (Liu et al., 21 Jan 2026).

2. Post-Aggregation vs. Pre-Aggregation Prompting

A fundamental design decision is whether to inject prompts before or after the aggregation operator. Node-feature prompting adds a learned vector pre-aggregation to node features. This approach, as in GPF, requires the prompt to invert the influence of the aggregation topology to cancel quantization-induced bias—a problem exacerbated by overlapping neighbor sets and highly coupled optimization dynamics. Theoretical analysis demonstrates that pre-aggregation prompting is entangled with the network topology, leading to slow and suboptimal optimization.

LoRAP improves upon this by applying prompts post-aggregation, which allows each node’s prompt to directly address its own aggregated quantization error, effectively decoupling the optimization and enabling targeted, one-step correction. Formally, the optimal post-aggregation prompt is Ppost=AϵXP^*_{post} = -A \epsilon_X, where AA is the aggregation matrix and ϵX\epsilon_X is the quantization error on the node features. The addition of a low-rank basis and input-dependence further boosts expressivity, allowing for the approximation of aggregated quantization errors up to the residual singular-value spectrum, as established by Theorem 3 via the Eckart–Young theorem (Liu et al., 21 Jan 2026).

3. Integration into Quantization-Aware Training

LoRAP is implemented as a drop-in module within existing QAT execution cycles, with no requirement for novel losses or gradient tricks:

  • Quantize incoming node features: HqQ(H)\mathbf{H}_q \gets Q(\mathbf{H})
  • Aggregate (quantized): SqQ(AHq)\mathbf{S}_q \gets Q(A\mathbf{H}_q)
  • Dequantize aggregation: S^DQ(Sq)\widehat{\mathbf{S}}\gets DQ(\mathbf{S}_q)
  • Compute input-dependent prompt: Apply mapping and softmax to obtain scores; form low-rank prompt matrix Ps\mathbf{P}_s
  • Requantize prompt-augmented aggregation: Sq=Q(S^+Ps)\mathbf{S}_q^* = Q(\widehat{\mathbf{S}}+\mathbf{P}_s)
  • Continue with quantized update: H=γ(Hq,Sq)\mathbf{H}' = \gamma(\mathbf{H}_q, \mathbf{S}_q^*)

Gradients propagate through fake quantization stochastically (straight-through estimator) into the quantizer parameters, network weights, and prompt matrices. All QAT frameworks, including Degree-Quant, A2QA^2Q, and MixQ, are compatible with LoRAP (Liu et al., 21 Jan 2026).

4. Computational and Memory Overhead

LoRAP is designed for efficiency in both storage and runtime. Per-layer prompt parameters total r(k+d)r(k+d) (e.g., with k=20k=20, r=2r=2, d=128d=128, and L=5L=5 layers, totaling approximately 1,480 floats or 5.9 KB). Node-level prompting techniques such as GPF require N×dN \times d parameters, which are orders of magnitude greater for large graphs.

During forward computation, LoRAP introduces O(Nkd)\mathcal{O}(Nkd) temporary score activations and O(Nkr+Nd)\mathcal{O}(Nkr + Nd) additional multiply-add operations per layer, which remains negligible for standard values of kk and rr. Latency evaluations demonstrate an increase of <44.5μ44.5\,\mus per layer (fused kernel), constituting less than 10% overhead in INT4 scenarios compared to full-precision inference. The memory footprint remains O(Lr(k+d))O(L r(k+d)) for storage, and the prominent bottleneck is score computation if N106N \gg 10^6 (Liu et al., 21 Jan 2026).

5. Empirical Evaluation and Performance Gains

LoRAP achieves consistent accuracy improvements across multiple GNN architectures (GIN, GCN, GAT) and diverse datasets (Cora, Citeseer, MNIST, CIFAR-10, ZINC, Reddit-Binary, ogb-Arxiv, ogb-Products, ogbn-Mag) under different QAT paradigms and quantization bit-widths.

Selected INT4 results for GIN, GCN, and GAT on prominent node classification datasets are as follows:

Setup No-Prompt (%) GPF-plus (%) LoRAP (%) Improvement
GIN+Cora (QAT) 44.2 45.6 45.1 +0.9
GIN+Citeseer (QAT) 18.7 21.6 22.8 +4.1
GIN+Cora (DQ) 61.0 68.8 +7.8
GCN+Cora (DQ) 72.5 78.0 +5.5
GAT+Cora (QAT) 52.8 66.8 +14.0
Reddit-Binary (GIN, WL4A4) 52.4 53.8 69.6 +17.2

On mixed-bit A2QA^2Q setups, LoRAP matches or surpasses full-precision accuracy (e.g., GIN+Cora: LoRAP 78.5%, FP32 77.6%). In graph regression (ZINC, INT4–A2QA^2Q), LoRAP reduces MAE from 0.414 (FP32) to 0.361 (+3.7% improvement). Ablation studies on MNIST/GIN (A2QA^2Q, INT4) demonstrate that LoRAP recovers nearly all quantization-induced accuracy loss with just a fraction of additional parameters relative to full-rank prompting. Performance remains favorable on large-scale benchmarks (e.g., ogb-arxiv: LoRAP 73.6% vs FP32 71.7%) (Liu et al., 21 Jan 2026).

6. Theoretical Guarantees, Limitations, and Extensions

Theoretical guarantees: Post-aggregation prompting as realized in LoRAP is supported by rigorous analysis. Theorem 1 establishes decoupling of quantization error correction from topology, while Theorem 3 proves that low-rank prompts can approximate the aggregated quantization error up to the tail singular values, controlled by the desired prompt rank kk.

Limitations: While LoRAP requires only a small constant-factor parameter overhead, it imposes per-layer float32 prompt computations and necessitates both dequantization and requantization per layer. On extreme-scale graphs (N107N \gtrsim 10^7), the O(Nk)O(Nk) memory for prompt score activations may present challenges. Empirical evaluation has focused on standard aggregation operators; adaptation to sophisticated attention or custom message propagation requires further exploration.

Potential extensions: The method generalizes beyond GNNs. Low-rank, input-dependent prompts could be injected into quantized convolutional or self-attention outputs in CNNs or Transformers. LoRAP is also compatible with post-training quantization (PTQ) by fine-tuning only the prompt bases on a small calibration set. Mixed-precision or hierarchical prompting strategies—e.g., adaptive allocation of kk, rr per layer—are natural extensions for handling layer-specific quantization dynamics (Liu et al., 21 Jan 2026).

7. Relation to Low-Rank Prompting in Vision Graph Networks

The principle of low-rank decomposition for prompt parameterization has also been deployed in vision graph neural architectures. In "Vision Graph Prompting via Semantic Low-Rank Decomposition" (SLRP) (Ai et al., 7 May 2025), the observation is that semantically similar regions in graph-structured image data occupy a low-rank subspace. The prompt basis and injection points are constructed to respect graph topology, leveraging principal component analysis (PCA) to inform the rank and directionality of prompt matrices. While the target domain differs, the low-rank, parameter-efficient, structure-aware design is a commonality.

Both LoRAP and SLRP emphasize aligning the expressivity of the prompt space with the intrinsic, low-dimensional semantic or quantization error structure present in graph data, facilitating efficient adaptation or correction while maintaining stringent efficiency and scalability constraints (Liu et al., 21 Jan 2026, Ai et al., 7 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Low-Rank Aggregation Prompting (LoRAP).