DINT Transformer: Global and Differential Attention

Updated 23 March 2026

The paper introduces the DINT Transformer which integrates differential local denoising with an integral global salience mechanism to overcome limitations in capturing distant token relationships.
It employs a unified parameter approach with strict row normalization, enhancing numerical stability while reducing free parameters and training tokens.
Empirical evaluations demonstrate significant gains in language modeling, key retrieval, and in-context learning, with improvements up to 12% accuracy over comparable models.

DINT Transformer is a sequence-modeling architecture designed to address core deficiencies of the DIFF Transformer, specifically its inability to capture global context and its numerically unstable attention matrices stemming from a lack of enforced row normalization. DINT introduces an integral (global) component to the attention mechanism and links local (differential) and global information flows through strict normalization and a unified parameterization. Empirical evaluations demonstrate significant gains in accuracy and robustness over standard and DIFF Transformers in long-context language modeling, key information retrieval, and in-context learning (Cang et al., 29 Jan 2025).

1. Differential Attention Core

The DINT Transformer builds upon the DIFF Transformer’s “differential attention” mechanism, which is designed to suppress irrelevant context (attention noise) by forming two parallel attention distributions. For an input embedding matrix $X \in \mathbb{R}^{N \times d_{\mathrm{model}}}$ , it computes two sets of queries and keys: $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ Let $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ , where each row sums to one. The differential attention output is given by: $\mathrm{DiffAttn}(X) = \left( \mathrm{Soft}(Q_1, K_1) - \lambda\,\mathrm{Soft}(Q_2, K_2)\right)\, V$ where $\lambda$ is a learnable, numerically stabilized scalar: $\lambda = \exp(A_{q1} \cdot A_{k1}) - \exp(A_{q2} \cdot A_{k2}) + \lambda_{\rm init}$ with $A_{qi}$ , $A_{ki} \in \mathbb{R}^d$ and a layer-dependent constant $\lambda_{\rm init} \in (0,1)$ .

This architecture enhances local denoising, making it resilient against contextually irrelevant signals but is inherently incapable of introducing information from distant tokens.

2. Integral (Global) Salience Mechanism

To overcome the DIFF Transformer's limitation in modeling global dependencies, DINT introduces an “integral” term, extracting global importance scores from the signal attention matrix: $A^{(1)} = \mathrm{Soft}(Q_1, K_1) \in \mathbb{R}^{N\times N}$ Column-wise averaging yields a global importance vector: $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 0 $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 1 encodes, for each position $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 2, its overall salience seen through attention across the sequence. This global information is broadcast row-wise to align shapes for attention integration.

3. Differential–Integral Attention Synthesis and Normalization

DINT’s final attention computation forms a strictly row-normalized matrix by combining the differential and (row-broadcasted) integral terms: $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 3 with $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 4 as defined above and $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 5 the broadcast global vector. Setting $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 6 and tying these scalars across all attention heads within a layer ensures: $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 7 This strict normalization improves gradient scale stability across layers and prevents drift.

4. Unified Parameter and Normalization Design

DINT adopts a “unified parameter” approach, setting $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 8 per layer and enforcing the same value across all heads. This yields three direct benefits:

Row-normalized attention matrices at every layer guarantee numerical stability.
Fewer free parameters (single scalar per layer) resulting in stronger regularization.
Simplified initialization with only a single $[Q_1; Q_2] = X W_Q, \quad [K_1; K_2] = X W_K, \quad V = X W_V$ 9 to set per layer.

Additionally, group normalization is applied to each head’s output prior to concatenation, mitigating residual statistical mismatch. Empirical ablations (see Table 6 of the source) show DINT is substantially more robust than DIFF to the removal of GroupNorm (Cang et al., 29 Jan 2025).

5. Training Regime, Implementation, and Complexity

DINT is architecturally matched to DIFF Transformer baselines: 3B parameters, 24 layers, hidden size $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 04K, feedforward size $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 111K, 8 attention heads of dimension 128. Optimization is performed with Adam (β₁ = 0.9, β₂ = 0.95), a learning rate of $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 2, batch size 4M tokens, weight decay 0.1, and no dropout. Both the differential and integral terms scale as $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 3 per layer, with no use of sparse or low-rank approximations. DINT has been validated on models up to 13B parameters and sequence lengths up to 64K tokens without changes to the algorithm.

6. Empirical Performance Across Task Domains

Language Modeling

On LM Eval Harness, DINT-3B achieves 62.2% average accuracy, compared to 60.6% for DIFF-3B and 57–58% for OpenLLaMA-v2 and StableLM (both 3B).
DINT reaches a given validation loss with 44% fewer parameters and 33% fewer training tokens compared to a standard Transformer.

Key Information Retrieval

On 4K-token contexts with up to 6 retrieval targets, DINT yields 0.88 accuracy versus DIFF (0.85) and the vanilla Transformer (0.55).
Scaling to 64K-token contexts, DINT maintains >0.9 accuracy across answer depths, exceeding DIFF by 12% and the standard Transformer by >50% in some settings.

In-Context Learning

On many-shot classification (datasets with 6 to 150 classes: TREC, TREC-Fine, Banking-77, Clinic-150), DINT outperforms DIFF by 2.8–4.3% absolute accuracy.
DINT shows reduced output variance under example order permutations, indicating increased robustness.

Ablation

DINT's loss is robust ( $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 40.005 change) for $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 5 in $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 6.
Removing GroupNorm has much less impact on DINT compared to DIFF.

7. Limitations, Prospects, and Extensions

The integral term in DINT reflects a column-wise averaging approach; more sophisticated statistics (variance, top-k) may offer orthogonal global encodings. The $\mathrm{Soft}(Q,K) = \mathrm{softmax}\left( Q K^\top/\sqrt{d} \right)$ 7 complexity constrains practical use on extremely long contexts; integrating sparse or low-rank approximations is a potential research avenue. While row normalization enforces stability, it also ties the weights of the differential and integral terms; allowing controlled decoupling could offer increased adaptability. Preliminary results indicate diminishing returns beyond 64K tokens, suggesting hybrid local–global attention schemes for broader scalability.

A plausible implication is that DINT’s combined differential-integral and normalization strategy delineates a new class of robust long-context transformers that both denoise attention locally and maintain coherent global token relevance, with potential applicability beyond the tested domains (Cang et al., 29 Jan 2025).

Markdown Report Issue Upgrade to Chat

References (1)

DINT Transformer (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to DINT Transformer.

DINT Transformer: Global and Differential Attention

1. Differential Attention Core

2. Integral (Global) Salience Mechanism

3. Differential–Integral Attention Synthesis and Normalization

4. Unified Parameter and Normalization Design

5. Training Regime, Implementation, and Complexity

6. Empirical Performance Across Task Domains

Language Modeling

Key Information Retrieval

In-Context Learning

Ablation

7. Limitations, Prospects, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

DINT Transformer: Global and Differential Attention

1. Differential Attention Core

2. Integral (Global) Salience Mechanism

3. Differential–Integral Attention Synthesis and Normalization

4. Unified Parameter and Normalization Design

5. Training Regime, Implementation, and Complexity

6. Empirical Performance Across Task Domains

Language Modeling

Key Information Retrieval

In-Context Learning

Ablation

7. Limitations, Prospects, and Extensions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research