Papers
Topics
Authors
Recent
Search
2000 character limit reached

Dynamic Bilinear Attention (DBA)

Updated 22 February 2026
  • Dynamic Bilinear Attention (DBA) is an attention mechanism that dynamically generates bilinear forms from input features to capture rich interactions while reducing computational cost.
  • It leverages learnable low-rank projections for adaptive compression, ensuring salient information is preserved in both Transformer and dynamic graph models.
  • Empirical results show DBA improves accuracy and memory efficiency in dynamic graphs and long-sequence modeling, outperforming traditional attention methods.

Dynamic Bilinear Attention (DBA) refers to a class of attention mechanisms characterized by the use of input-sensitive, dynamically constructed bilinear forms to enable information propagation or sequence modeling. DBA arises in multiple domains, notably dynamic graphs (Knyazev et al., 2019) and efficient Transformer architectures (Qin et al., 2022). It leverages learnable low-rank bilinear interactions to parameterize attention weights or event intensities, enabling adaptive, expressive, and efficient representations beyond what static or concatenation-based attentions can provide.

1. Bilinear Attention: Core Mechanisms

Dynamic Bilinear Attention replaces traditional concatenation-based vector interactions with a bilinear form. Given representations xi,xjRdx_i, x_j \in \mathbb{R}^d, attention or compatibility scores are computed as

aij=xiWxja_{ij} = x_i^\top W x_j

where WRd×dW \in \mathbb{R}^{d \times d} is a learned parameter matrix. This formulation supports richer feature interactions compared to linear layers on concatenated vectors. In graph-temporal contexts, bilinear attention weights are normalized via softmax over a node’s (dynamic) neighborhood:

αij=exp(aij)kNiexp(aik)\alpha_{ij} = \frac{\exp(a_{ij})}{\sum_{k \in \mathcal{N}_i} \exp(a_{ik})}

enabling localized, time-dependent information propagation (Knyazev et al., 2019).

In low-rank efficient Transformers, DBA generalizes this structure to full sequences, incorporating additional low-rank projections over both sequence length and hidden dimension to yield a bilinear softmax kernel in reduced space (Qin et al., 2022):

A=softmax(QdbKdbdin)A = \mathrm{softmax}\left(\frac{Q_{\mathrm{db}} K_{\mathrm{db}}^\top}{\sqrt{d_{in}}}\right)

with Qdb=Pr(Q)QWQ_{\mathrm{db}} = P_r(Q) Q W, Kdb=Pc(K)KWK_{\mathrm{db}} = P_c(K) K W, and learnable dynamic projections Pr,PcP_r, P_c.

2. Dynamic Construction and Input-Adaptivity

DBA distinguishes itself through input-sensitivity: bilinear projection matrices or attention patterns are not fixed but generated dynamically from the current input. In efficient Transformer settings, dynamic neural networks ϕr\phi_r, ϕc\phi_c process QQQ Q^\top and KKK K^\top to generate sequence-compression matrices Pr(Q),Pc(K)Rdp×nP_r(Q), P_c(K) \in \mathbb{R}^{d_p \times n}, along with analogous reconstructions for full-length output:

Pr(Q)=ϕr(QQ),Pc(K)=ϕc(KK)P_r(Q) = \phi_r(Q Q^\top),\qquad P_c(K) = \phi_c(K K^\top)

This allows the mechanism to allocate compression or attention capacity adaptively to the most informative or salient subsequence, preserving critical information that static projections may overlook, especially in datasets where salient regions differ per sample (Qin et al., 2022).

In dynamic graphs, the latent attention graph is inferred at each timestep via a variational autoencoder (VAE), using neural message-passing schemes with bilinear layers to compute the posterior over multi-relational adjacency tensors StS^t conditional on past embeddings (Knyazev et al., 2019).

3. Mathematical Formulation and Theoretical Properties

DBA formalizes the tradeoff between computational efficiency and representational fidelity by adopting joint low-rank projections on both sequence length and hidden state:

  1. Sequence Compression: Project queries and keys to rank dpnd_p \ll n via input-dependent Pr,PcP_r, P_c.
  2. Hidden Dimension Compression: Project to dindd_{in} \ll d using a learnable WW (inspired by the Johnson-Lindenstrauss Lemma for inner-product preservation).
  3. Bilinear Attention Map: Softmax is applied in the reduced space, maintaining the expressive power of full attention while saving computation:

DBA(Q,K,V)=Pr(Q)  softmax ⁣(QdbKdbdin) ⁣PV(V)\mathrm{DBA}(Q, K, V) = P_r'(Q) \; \mathrm{softmax}\!\left(\frac{Q_{\mathrm{db}} K_{\mathrm{db}}^\top}{\sqrt{d_{in}}}\right)\!P_V(V)

Theoretical analysis establishes that (a) sequence-length compression can be lossless under the formulation, and (b) the error in hidden-dimension approximation is tightly controlled probabilistically, with guarantees derived from extensions of the Johnson–Lindenstrauss lemma (Qin et al., 2022).

In the context of temporal point processes for graphs, bilinear attention is also used to parameterize event intensities, e.g.:

λkt=ψklog(1+exp(zitΩkzjtψk))\lambda_k^t = \psi_k \log \left(1 + \exp\left( \frac{z_i^{t^-}{}^\top \Omega_k z_j^{t^-}}{\psi_k} \right)\right)

where Ωk\Omega_k is a learned bilinear matrix for event type kk (Knyazev et al., 2019).

4. Algorithmic Implementation and Computational Complexity

DBA mechanisms entail several steps per layer:

  1. Dynamic Sequence Compression: O(nddp)\mathcal{O}(n d d_p) to form PrP_r, PcP_c and project QQ, KK.
  2. Hidden-Dimension Compression: O(dpddin)\mathcal{O}(d_p d d_{in}) for multiplying by WW.
  3. Bilinear Softmax and Output Reconstruction: ARdp×dpA \in \mathbb{R}^{d_p\times d_p} at cost O(dp2din)\mathcal{O}(d_p^2 d_{in}); final output reconstructed to length nn.
  4. Total Complexity: O(nddp+dp2din+ndpd)\mathcal{O}(n d d_p + d_p^2 d_{in} + n d_p d), linear in input length for moderate dp,dind_p, d_{in}, representing a significant reduction from the O(n2d)\mathcal{O}(n^2d) of standard attention (Qin et al., 2022).

In dynamic graphs, similar efficiency gains are realized in learning the sparse latent attention graph through end-to-end optimization of the VAE, with backpropagation flowing through Gumbel-softmax samples and bilinear edge attention (Knyazev et al., 2019).

5. Applications and Empirical Results

DBA has been validated empirically in both dynamic graph modeling and efficient long-sequence modeling:

  • Dynamic Graphs: The Latent Dynamic Graph (LDG) model uses a bilinear attention mechanism for temporal feature propagation and edge prediction. On Social Evolution and GitHub datasets, switching to bilinear intensity reduced mean average rank (MAR) dramatically, e.g. on Social Evolution, DyRep (concat) with CloseFriend gives MAR ≈ 16.0, reduced to MAR ≈ 11.0 with bilinear intensity. LDG with learned attention and bilinear layers achieved further improvements (MAR ≈ 12.7; HITS@10 up to 0.50) (Knyazev et al., 2019).
  • Transformers and Sequence Models: On Long-Range Arena (LRA), DBA achieved state-of-the-art mean accuracy (62.21%), outperforming Vanilla Transformer (58.57%) and other efficient Transformer baselines, with speedups up to 6.1×6.1\times at n=4n=4k and 9% of the memory compared to Vanilla. On UEA time-series benchmarks, DBA attained top accuracy (mean 73.9%). On VQA-v2, DBA improved overall accuracy and reduced parameter count (68.53% vs. 67.17% for MCAN, using only 12% of its parameters) (Qin et al., 2022).

6. Comparative Discussion, Interpretability, and Limitations

DBA contrasts with related efficient attention mechanisms:

Method Sequence Proj. Input-Adaptive? Hidden Proj. Complexity Notes
Linformer Fixed Pr,PcP_r,P_c No No Linear Static projections, less adaptive
Performer N/A N/A Random Linear No sequence-length compression
DBA Dynamic Yes JL-inspired Linear Both sequence and hidden compression

DBA’s dynamic projections preserve per-sample informative regions, capture higher-order cross-attention, and maintain theoretical guarantees of information retention in compression. In dynamic graph settings, learned attention graphs (via DBA) are interpretable and match human-designated associations, achieving higher AUC against reference graphs (e.g. from AUC ≈ 65 to 76 under uniform prior; to 84 under sparse prior on Social Evolution) (Knyazev et al., 2019).

Limitations include computational overhead for forming input-sensitive projection matrices, and the need to balance compression ranks (dp,din)(d_p, d_{in}) against accuracy and speed trade-offs. Future work is suggested in integrating relative-positional encodings and adapting the dynamic networks for projection computation (Qin et al., 2022).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Dynamic Bilinear Attention (DBA).