Naturalness of Attention: Revisiting Attention in Code Language Models (2311.13508v1)

Published 22 Nov 2023 in cs.SE and cs.LG

Abstract: LLMs for code such as CodeBERT offer the capability to learn advanced source code representation, but their opacity poses barriers to understanding of captured properties. Recent attention analysis studies provide initial interpretability insights by focusing solely on attention weights rather than considering the wider context modeling of Transformers. This study aims to shed some light on the previously ignored factors of the attention mechanism beyond the attention weights. We conduct an initial empirical study analyzing both attention distributions and transformed representations in CodeBERT. Across two programming languages, Java and Python, we find that the scaled transformation norms of the input better capture syntactic structure compared to attention weights alone. Our analysis reveals characterization of how CodeBERT embeds syntactic code properties. The findings demonstrate the importance of incorporating factors beyond just attention weights for rigorously understanding neural code models. This lays the groundwork for developing more interpretable models and effective uses of attention mechanisms in program analysis.

Citations (2)

View on Semantic Scholar

Summary

The paper reformulates multi-head attention by integrating transformation norms with attention weights to more accurately capture code structure.
The paper employs empirical analysis on CodeBERT using Java and Python snippets from CodeSearchNet, revealing a canceling effect between the attention components.
The paper finds that scaled transformation norms align more closely with abstract syntax tree properties, offering deeper insights into model behavior.

The paper "Naturalness of Attention: Revisiting Attention in Code LLMs" introduces an analysis of attention mechanisms within LLMs of Code (LMC) such as CodeBERT. The paper posits that prior analyses focusing solely on attention weights overlook other pertinent factors within the Transformer architecture's multi-head attention mechanism. The authors present a reformulation of the multi-head attention mechanism to emphasize the importance of considering both attention weights and the transformation of input representations.

The authors conduct an empirical paper using CodeBERT on Java and Python code snippets from the CodeSearchNet dataset. Their methodology involves extracting attention weights, transformation norms, and scaled transformation norms from each self-attention head. Tokens are categorized as keywords, identifiers, literals, operators, and special symbols, based on the grammar of each language. The paper addresses two primary research questions:

RQ1: How do the general trends across layers between attention weights ( $\alpha$ ) and the scaled transformations norms ( $\alpha f(x)$ ) compare?
RQ2: How does $\alpha f(x)$ align with the syntactic structure of source code compared to attention weights?

Key findings and observations include:

RQ1 Results: The paper found that special tokens like [CLS] (classification token) and [SEP] (separation token) exhibit higher average attention weights, aligning with findings in BERT analysis, but contradicting prior work on code-specific BERT models. However, the contribution of these tokens, as measured by scaled transformation norms, is lower than that of identifiers and special symbols. The paper highlights a "canceling effect" between attention weights and transformation norms, where one factor may compensate for the other. For example, constant attention weights across layers may coincide with peaks or declines in scaled transformation norms.
- The authors show that when CodeBERT does not find information in the input, it assigns higher attention values to special tokens, given that attention weights should sum to 1 due to the softmax function.
RQ2 Results: The researchers assessed the syntactic alignment of attention weights and scaled transformation norms using a metric $p_{\alpha}(g)$ $p_{α} (g)$ , which measures agreement between attention maps and property maps derived from Abstract Syntax Trees (AST). They define $g(i, j)$ $g (i, j)$ to return 1 if tokens $i$ $i$ and $j$ $j$ share the same parent in the AST, and 0 otherwise. The results indicate that scaled transformation norms generally exhibit better alignment with syntactic properties than attention weights alone, particularly in earlier layers. However, there are specific layers where attention weights show higher alignment.
- $p_{\alpha}(g)$ is formally defined as:
  
  ${p_\alpha}(g) = \frac{ \sum\limits_{\mathbf{x} \in \mathbf{X} \sum\limits_{i=1}^{|\mathbf{x}|}\sum\limits_{j=1}^{|\mathbf{x}|} {f(i, j)} \cdot \mathbbm{1}_{\alpha_{i,j} > \theta} }{ \sum\limits_{\mathbf{x} \in \mathbf{X}\sum\limits_{i=1}^{|\mathbf{x}|}\sum\limits_{j=1}^{|\mathbf{x}|} \mathbbm{1}_{\alpha_{i,j} > \theta} }$
  
  where
  - $\mathbf{X}$ is the set of code snippets
  - $|\mathbf{x}|$ is the number of tokens in code snippet $\mathbf{x}$
  - $f(i, j)$ returns 1 if tokens $i$ and $j$ share the same parent in the AST, 0 otherwise
  - $\alpha_{i,j}$ is the attention weight between tokens $i$ and $j$
  - $\theta$ is a threshold for high-confidence attention weights
  - $\mathbbm{1}_{\alpha_{i,j} > \theta}$ is an indicator function that equals 1 if $\alpha_{i,j} > \theta$ , 0 otherwise

The authors conclude that analyzing attention mechanisms in LMCs requires considering scaled transformation norms in addition to attention weights. The distinct behaviors of attention weights and scaled transformation norms suggest that each captures different aspects of code properties.

Future research directions involve expanding the paper to other programming languages and LMCs like GraphCodeBERT and CodeT5, and analyzing models trained with programming language-oriented techniques.

PDF Markdown

Related Papers

GitHub

GitHub - SMART-Dal/norm-analysis-clm (1 star)