More Expressive Attention with Negative Weights (2411.07176v3)

Published 11 Nov 2024 in cs.CL, cs.AI, and cs.LG

Abstract: We propose a novel attention mechanism, named Cog Attention, that enables attention weights to be negative for enhanced expressiveness, which stems from two key factors: (1) Cog Attention enhances parameter flexibility. For example, unlike traditional softmax attention heads that use a static output-value (OV) matrix to delete or copy inputs that the heads attend to, Cog Attention naturally learns to use the sign of dynamic query-key (QK) inner products to represent these operations. This enables Cog Attention to perform multiple operations simultaneously within a single head. Meanwhile, Cog Attention's OV matrix can focus more on refinement or modification. (2) Cog Attention enhances the model's robustness against representational collapse by preventing the ``over-squashing'' of earlier tokens into later positions. We develop Transformer-like models which use Cog Attention as attention modules, including decoder-only models at various scales for LLMing and U-ViT diffusion models for image generation. Experiments show that models using Cog Attention exhibit superior performance compared to those employing traditional softmax attention modules. Our approach suggests a promising research direction for rethinking and breaking the entrenched constraints of traditional softmax attention, such as the requirement for non-negative weights.

Summary

The paper introduces the Cog Attention mechanism, allowing negative weights to dynamically manage token deletion, copying, or retention.
It reforms softmax attention by applying a sign-preserving nonlinearity, enhancing flexibility and interpretability.
Empirical evaluations show improved performance in language modeling and image generation while mitigating representational collapse.

The paper introduces a novel attention mechanism, "More Expressive Attention with Negative Weights", which rethinks the conventional softmax-based attention by allowing for negative attention weights. The authors argue that the standard softmax operation—by enforcing non-negativity and a normalized probability distribution—limits the expressiveness of Transformer models. Instead, the proposed mechanism, termed Cog Attention, shifts part of the processing (e.g., deletion or copying of tokens) from a static output transformation matrix (OV matrix) to the dynamic query–key (QK) inner products by permitting negative values.

The technical contributions and results can be summarized as follows:

Reformulation of the Attention Mechanism The standard attention calculation is modified by redefining the nonlinearity applied to the QK inner products. Whereas softmax is given by

$\text{softmax}(p_{i,j}) = \frac{\exp(p_{i,j} - m_i)}{\sum_{k=0}^{i} \exp(p_{i,k} - m_i)},$

with $m_i = \max(p_i)$ , the proposed scheme introduces the function

$\phi(p_{i,j}) = \frac{s_{i,j} \cdot \exp(s_{i,j} \cdot p_{i,j} - m_i)}{\sum_{k=0}^{i} \lvert s_{i,k} \cdot \exp(s_{i,k} \cdot p_{i,k} - m_i) \rvert},$

where $s_{i,j} = \operatorname{sign}(p_{i,j})$ and $m_i = \max(\lvert p_i \rvert)$ . This formulation recovers the sign after the transformation to maintain gradient stability while ensuring sufficient kurtosis in the activation profile. Notably, the denominator normalizes by the sum of absolute values, thereby avoiding potential division-by-zero issues when negative contributions occur.

Enhanced Flexibility and Mechanistic Interpretability
- A traditional softmax head assigns high positive weights uniformly, leading to adverse “friendly fire” where useful tokens may inadvertently be suppressed.
- In contrast, a Cog Attention head exhibits a decoupling between the attention weight sign and the output transformation, enabling selective elimination of distracting tokens while preserving informative ones. This is supported by lower Pearson correlation coefficients between the inner product values and attention weights, along with eigenvalue analyses that suggest the OV matrix in the Cog Attention head is less biased toward simple deletion, thus providing room for refinement.
Robustness Against Representational Collapse The paper also explains the tendency of deep Transformer models to exhibit representational collapse, where the final token representations become too homogeneous. The authors hypothesize that negative weights reduce effective information paths from early tokens to later ones, thereby mitigating over-squashing. Two tasks, “Finding a Zero” and “Counting Ones,” are used to quantitatively assess collapse via the relative $L_{\infty}$ norm differences between sequences that differ only in one initial token. Empirical results show that models equipped with Cog Attention maintain significantly higher $L_{\infty}$ norms relative to traditional softmax-based models, indicating improved sensitivity to subtle differences in the input sequence.
Applications and Empirical Evaluations Cog Attention is incorporated into Transformer-like architectures, referred to as Cogformer, and evaluated in two separate domains:
- LLMing:
Decoder-only models with approximately 141 million parameters are trained on the RedPajama dataset. Across various benchmark tasks (such as ARC-E, ARC-C, PIQA, SST-2, MNLI, MRPC, QQP, and RTE), Cogformer shows a consistent improvement in average accuracy compared to standard Transformers. Training loss curves over an initial phase of 50,000 steps indicate nearly identical convergence profiles once softmax is preserved in the first and last layers to stabilize training. - Image Generation:

For diffusion-based image generation, a variant of U-ViT is constructed by replacing the internal softmax attention with Cog Attention (except at the first and last layers), yielding a model dubbed U-ViC. Empirical evaluations on CIFAR-10 (for unconditional generation) and MS-COCO (for text-conditioned generation) demonstrate improvements in Fréchet Inception Distance (FID) scores relative to the baseline U-ViT model. Although the implementation of Cog Attention incurs a modest time overhead due to extra operations (such as computing absolute values and additional multiplications), the improved expressiveness leads to tangible performance gains.
Discussion and Future Directions
- Convergence considerations: The authors found that using Cog Attention across all layers may slow convergence during early training stages. They mitigate this by applying the traditional softmax in the first (and sometimes last) layer, ensuring a smoother optimization landscape initially.
- Attention Pattern Diversity: Analysis of attention maps reveals that Cog Attention produces more heterogeneous and diverse patterns—with reduced attention sink—compared to the sparsity often observed in vanilla Transformers. This indicates that more attention heads actively contribute to processing the input.
- Potential Efficiency Trade-offs: The method incurs a slight computational overhead, and the authors note that further research is needed to optimize the efficiency of the proposed mechanism.

In summary, the paper presents a comprehensive analysis of how allowing negative attention weights can lead to improved expressiveness of attention mechanisms. By shifting some token-processing functions into the dynamic QK inner products and mitigating issues like representational collapse, the proposed Cog Attention mechanism demonstrates measurable benefits in both LLMing and image generation tasks.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Euclaise_/status/1856425620034490494