Dynamic Bilinear Attention (DBA)
- Dynamic Bilinear Attention (DBA) is an attention mechanism that dynamically generates bilinear forms from input features to capture rich interactions while reducing computational cost.
- It leverages learnable low-rank projections for adaptive compression, ensuring salient information is preserved in both Transformer and dynamic graph models.
- Empirical results show DBA improves accuracy and memory efficiency in dynamic graphs and long-sequence modeling, outperforming traditional attention methods.
Dynamic Bilinear Attention (DBA) refers to a class of attention mechanisms characterized by the use of input-sensitive, dynamically constructed bilinear forms to enable information propagation or sequence modeling. DBA arises in multiple domains, notably dynamic graphs (Knyazev et al., 2019) and efficient Transformer architectures (Qin et al., 2022). It leverages learnable low-rank bilinear interactions to parameterize attention weights or event intensities, enabling adaptive, expressive, and efficient representations beyond what static or concatenation-based attentions can provide.
1. Bilinear Attention: Core Mechanisms
Dynamic Bilinear Attention replaces traditional concatenation-based vector interactions with a bilinear form. Given representations , attention or compatibility scores are computed as
where is a learned parameter matrix. This formulation supports richer feature interactions compared to linear layers on concatenated vectors. In graph-temporal contexts, bilinear attention weights are normalized via softmax over a node’s (dynamic) neighborhood:
enabling localized, time-dependent information propagation (Knyazev et al., 2019).
In low-rank efficient Transformers, DBA generalizes this structure to full sequences, incorporating additional low-rank projections over both sequence length and hidden dimension to yield a bilinear softmax kernel in reduced space (Qin et al., 2022):
with , , and learnable dynamic projections .
2. Dynamic Construction and Input-Adaptivity
DBA distinguishes itself through input-sensitivity: bilinear projection matrices or attention patterns are not fixed but generated dynamically from the current input. In efficient Transformer settings, dynamic neural networks , process and to generate sequence-compression matrices , along with analogous reconstructions for full-length output:
This allows the mechanism to allocate compression or attention capacity adaptively to the most informative or salient subsequence, preserving critical information that static projections may overlook, especially in datasets where salient regions differ per sample (Qin et al., 2022).
In dynamic graphs, the latent attention graph is inferred at each timestep via a variational autoencoder (VAE), using neural message-passing schemes with bilinear layers to compute the posterior over multi-relational adjacency tensors conditional on past embeddings (Knyazev et al., 2019).
3. Mathematical Formulation and Theoretical Properties
DBA formalizes the tradeoff between computational efficiency and representational fidelity by adopting joint low-rank projections on both sequence length and hidden state:
- Sequence Compression: Project queries and keys to rank via input-dependent .
- Hidden Dimension Compression: Project to using a learnable (inspired by the Johnson-Lindenstrauss Lemma for inner-product preservation).
- Bilinear Attention Map: Softmax is applied in the reduced space, maintaining the expressive power of full attention while saving computation:
Theoretical analysis establishes that (a) sequence-length compression can be lossless under the formulation, and (b) the error in hidden-dimension approximation is tightly controlled probabilistically, with guarantees derived from extensions of the Johnson–Lindenstrauss lemma (Qin et al., 2022).
In the context of temporal point processes for graphs, bilinear attention is also used to parameterize event intensities, e.g.:
where is a learned bilinear matrix for event type (Knyazev et al., 2019).
4. Algorithmic Implementation and Computational Complexity
DBA mechanisms entail several steps per layer:
- Dynamic Sequence Compression: to form , and project , .
- Hidden-Dimension Compression: for multiplying by .
- Bilinear Softmax and Output Reconstruction: at cost ; final output reconstructed to length .
- Total Complexity: , linear in input length for moderate , representing a significant reduction from the of standard attention (Qin et al., 2022).
In dynamic graphs, similar efficiency gains are realized in learning the sparse latent attention graph through end-to-end optimization of the VAE, with backpropagation flowing through Gumbel-softmax samples and bilinear edge attention (Knyazev et al., 2019).
5. Applications and Empirical Results
DBA has been validated empirically in both dynamic graph modeling and efficient long-sequence modeling:
- Dynamic Graphs: The Latent Dynamic Graph (LDG) model uses a bilinear attention mechanism for temporal feature propagation and edge prediction. On Social Evolution and GitHub datasets, switching to bilinear intensity reduced mean average rank (MAR) dramatically, e.g. on Social Evolution, DyRep (concat) with CloseFriend gives MAR ≈ 16.0, reduced to MAR ≈ 11.0 with bilinear intensity. LDG with learned attention and bilinear layers achieved further improvements (MAR ≈ 12.7; HITS@10 up to 0.50) (Knyazev et al., 2019).
- Transformers and Sequence Models: On Long-Range Arena (LRA), DBA achieved state-of-the-art mean accuracy (62.21%), outperforming Vanilla Transformer (58.57%) and other efficient Transformer baselines, with speedups up to at k and 9% of the memory compared to Vanilla. On UEA time-series benchmarks, DBA attained top accuracy (mean 73.9%). On VQA-v2, DBA improved overall accuracy and reduced parameter count (68.53% vs. 67.17% for MCAN, using only 12% of its parameters) (Qin et al., 2022).
6. Comparative Discussion, Interpretability, and Limitations
DBA contrasts with related efficient attention mechanisms:
| Method | Sequence Proj. | Input-Adaptive? | Hidden Proj. | Complexity | Notes |
|---|---|---|---|---|---|
| Linformer | Fixed | No | No | Linear | Static projections, less adaptive |
| Performer | N/A | N/A | Random | Linear | No sequence-length compression |
| DBA | Dynamic | Yes | JL-inspired | Linear | Both sequence and hidden compression |
DBA’s dynamic projections preserve per-sample informative regions, capture higher-order cross-attention, and maintain theoretical guarantees of information retention in compression. In dynamic graph settings, learned attention graphs (via DBA) are interpretable and match human-designated associations, achieving higher AUC against reference graphs (e.g. from AUC ≈ 65 to 76 under uniform prior; to 84 under sparse prior on Social Evolution) (Knyazev et al., 2019).
Limitations include computational overhead for forming input-sensitive projection matrices, and the need to balance compression ranks against accuracy and speed trade-offs. Future work is suggested in integrating relative-positional encodings and adapting the dynamic networks for projection computation (Qin et al., 2022).