- The paper reformulates multi-head attention by integrating transformation norms with attention weights to more accurately capture code structure.
- The paper employs empirical analysis on CodeBERT using Java and Python snippets from CodeSearchNet, revealing a canceling effect between the attention components.
- The paper finds that scaled transformation norms align more closely with abstract syntax tree properties, offering deeper insights into model behavior.
The paper "Naturalness of Attention: Revisiting Attention in Code LLMs" introduces an analysis of attention mechanisms within LLMs of Code (LMC) such as CodeBERT. The paper posits that prior analyses focusing solely on attention weights overlook other pertinent factors within the Transformer architecture's multi-head attention mechanism. The authors present a reformulation of the multi-head attention mechanism to emphasize the importance of considering both attention weights and the transformation of input representations.
The authors conduct an empirical paper using CodeBERT on Java and Python code snippets from the CodeSearchNet dataset. Their methodology involves extracting attention weights, transformation norms, and scaled transformation norms from each self-attention head. Tokens are categorized as keywords, identifiers, literals, operators, and special symbols, based on the grammar of each language. The paper addresses two primary research questions:
- RQ1: How do the general trends across layers between attention weights (α) and the scaled transformations norms (αf(x)) compare?
- RQ2: How does αf(x) align with the syntactic structure of source code compared to attention weights?
Key findings and observations include:
- RQ1 Results: The paper found that special tokens like [CLS] (classification token) and [SEP] (separation token) exhibit higher average attention weights, aligning with findings in BERT analysis, but contradicting prior work on code-specific BERT models. However, the contribution of these tokens, as measured by scaled transformation norms, is lower than that of identifiers and special symbols. The paper highlights a "canceling effect" between attention weights and transformation norms, where one factor may compensate for the other. For example, constant attention weights across layers may coincide with peaks or declines in scaled transformation norms.
- The authors show that when CodeBERT does not find information in the input, it assigns higher attention values to special tokens, given that attention weights should sum to 1 due to the softmax function.
- RQ2 Results: The researchers assessed the syntactic alignment of attention weights and scaled transformation norms using a metric pα(g), which measures agreement between attention maps and property maps derived from Abstract Syntax Trees (AST). They define g(i,j) to return 1 if tokens i and j share the same parent in the AST, and 0 otherwise. The results indicate that scaled transformation norms generally exhibit better alignment with syntactic properties than attention weights alone, particularly in earlier layers. However, there are specific layers where attention weights show higher alignment.
pα(g) is formally defined as:
${p_\alpha}(g) = \frac{ \sum\limits_{\mathbf{x} \in \mathbf{X} \sum\limits_{i=1}^{|\mathbf{x}|}\sum\limits_{j=1}^{|\mathbf{x}|}
{f(i, j)} \cdot \mathbbm{1}_{\alpha_{i,j} > \theta}
}{ \sum\limits_{\mathbf{x} \in \mathbf{X}\sum\limits_{i=1}^{|\mathbf{x}|}\sum\limits_{j=1}^{|\mathbf{x}|}
\mathbbm{1}_{\alpha_{i,j} > \theta} }$
where
- X is the set of code snippets
- ∣x∣ is the number of tokens in code snippet x
- f(i,j) returns 1 if tokens i and j share the same parent in the AST, 0 otherwise
- αi,j is the attention weight between tokens i and j
- θ is a threshold for high-confidence attention weights
- $\mathbbm{1}_{\alpha_{i,j} > \theta}$ is an indicator function that equals 1 if αi,j>θ, 0 otherwise
The authors conclude that analyzing attention mechanisms in LMCs requires considering scaled transformation norms in addition to attention weights. The distinct behaviors of attention weights and scaled transformation norms suggest that each captures different aspects of code properties.
Future research directions involve expanding the paper to other programming languages and LMCs like GraphCodeBERT and CodeT5, and analyzing models trained with programming language-oriented techniques.