Papers
Topics
Authors
Recent
Search
2000 character limit reached

DINT Transformer: Global and Differential Attention

Updated 23 March 2026
  • The paper introduces the DINT Transformer which integrates differential local denoising with an integral global salience mechanism to overcome limitations in capturing distant token relationships.
  • It employs a unified parameter approach with strict row normalization, enhancing numerical stability while reducing free parameters and training tokens.
  • Empirical evaluations demonstrate significant gains in language modeling, key retrieval, and in-context learning, with improvements up to 12% accuracy over comparable models.

DINT Transformer is a sequence-modeling architecture designed to address core deficiencies of the DIFF Transformer, specifically its inability to capture global context and its numerically unstable attention matrices stemming from a lack of enforced row normalization. DINT introduces an integral (global) component to the attention mechanism and links local (differential) and global information flows through strict normalization and a unified parameterization. Empirical evaluations demonstrate significant gains in accuracy and robustness over standard and DIFF Transformers in long-context language modeling, key information retrieval, and in-context learning (Cang et al., 29 Jan 2025).

1. Differential Attention Core

The DINT Transformer builds upon the DIFF Transformer’s “differential attention” mechanism, which is designed to suppress irrelevant context (attention noise) by forming two parallel attention distributions. For an input embedding matrix XRN×dmodelX \in \mathbb{R}^{N \times d_{\mathrm{model}}}, it computes two sets of queries and keys: [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V Let Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right), where each row sums to one. The differential attention output is given by: DiffAttn(X)=(Soft(Q1,K1)λSoft(Q2,K2))V\mathrm{DiffAttn}(X) = \left( \mathrm{Soft}(Q_1, K_1) - \lambda\,\mathrm{Soft}(Q_2, K_2)\right)\, V where λ\lambda is a learnable, numerically stabilized scalar: λ=exp(Aq1Ak1)exp(Aq2Ak2)+λinit\lambda = \exp(A_{q1} \cdot A_{k1}) - \exp(A_{q2} \cdot A_{k2}) + \lambda_{\rm init} with AqiA_{qi}, AkiRdA_{ki} \in \mathbb{R}^d and a layer-dependent constant λinit(0,1)\lambda_{\rm init} \in (0,1).

This architecture enhances local denoising, making it resilient against contextually irrelevant signals but is inherently incapable of introducing information from distant tokens.

2. Integral (Global) Salience Mechanism

To overcome the DIFF Transformer's limitation in modeling global dependencies, DINT introduces an “integral” term, extracting global importance scores from the signal attention matrix: A(1)=Soft(Q1,K1)RN×NA^{(1)} = \mathrm{Soft}(Q_1, K_1) \in \mathbb{R}^{N\times N} Column-wise averaging yields a global importance vector: [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V0 [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V1 encodes, for each position [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V2, its overall salience seen through attention across the sequence. This global information is broadcast row-wise to align shapes for attention integration.

3. Differential–Integral Attention Synthesis and Normalization

DINT’s final attention computation forms a strictly row-normalized matrix by combining the differential and (row-broadcasted) integral terms: [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V3 with [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V4 as defined above and [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V5 the broadcast global vector. Setting [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V6 and tying these scalars across all attention heads within a layer ensures: [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V7 This strict normalization improves gradient scale stability across layers and prevents drift.

4. Unified Parameter and Normalization Design

DINT adopts a “unified parameter” approach, setting [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V8 per layer and enforcing the same value across all heads. This yields three direct benefits:

  • Row-normalized attention matrices at every layer guarantee numerical stability.
  • Fewer free parameters (single scalar per layer) resulting in stronger regularization.
  • Simplified initialization with only a single [Q1;Q2]=XWQ,[K1;K2]=XWK,V=XWV[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V9 to set per layer.

Additionally, group normalization is applied to each head’s output prior to concatenation, mitigating residual statistical mismatch. Empirical ablations (see Table 6 of the source) show DINT is substantially more robust than DIFF to the removal of GroupNorm (Cang et al., 29 Jan 2025).

5. Training Regime, Implementation, and Complexity

DINT is architecturally matched to DIFF Transformer baselines: 3B parameters, 24 layers, hidden size Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)04K, feedforward size Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)111K, 8 attention heads of dimension 128. Optimization is performed with Adam (β₁ = 0.9, β₂ = 0.95), a learning rate of Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)2, batch size 4M tokens, weight decay 0.1, and no dropout. Both the differential and integral terms scale as Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)3 per layer, with no use of sparse or low-rank approximations. DINT has been validated on models up to 13B parameters and sequence lengths up to 64K tokens without changes to the algorithm.

6. Empirical Performance Across Task Domains

Language Modeling

  • On LM Eval Harness, DINT-3B achieves 62.2% average accuracy, compared to 60.6% for DIFF-3B and 57–58% for OpenLLaMA-v2 and StableLM (both 3B).
  • DINT reaches a given validation loss with 44% fewer parameters and 33% fewer training tokens compared to a standard Transformer.

Key Information Retrieval

  • On 4K-token contexts with up to 6 retrieval targets, DINT yields 0.88 accuracy versus DIFF (0.85) and the vanilla Transformer (0.55).
  • Scaling to 64K-token contexts, DINT maintains >0.9 accuracy across answer depths, exceeding DIFF by 12% and the standard Transformer by >50% in some settings.

In-Context Learning

  • On many-shot classification (datasets with 6 to 150 classes: TREC, TREC-Fine, Banking-77, Clinic-150), DINT outperforms DIFF by 2.8–4.3% absolute accuracy.
  • DINT shows reduced output variance under example order permutations, indicating increased robustness.

Ablation

  • DINT's loss is robust (Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)40.005 change) for Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)5 in Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)6.
  • Removing GroupNorm has much less impact on DINT compared to DIFF.

7. Limitations, Prospects, and Extensions

The integral term in DINT reflects a column-wise averaging approach; more sophisticated statistics (variance, top-k) may offer orthogonal global encodings. The Soft(Q,K)=softmax(QK/d)\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)7 complexity constrains practical use on extremely long contexts; integrating sparse or low-rank approximations is a potential research avenue. While row normalization enforces stability, it also ties the weights of the differential and integral terms; allowing controlled decoupling could offer increased adaptability. Preliminary results indicate diminishing returns beyond 64K tokens, suggesting hybrid local–global attention schemes for broader scalability.

A plausible implication is that DINT’s combined differential-integral and normalization strategy delineates a new class of robust long-context transformers that both denoise attention locally and maintain coherent global token relevance, with potential applicability beyond the tested domains (Cang et al., 29 Jan 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
1.
DINT Transformer  (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DINT Transformer.