Analysis and Interpretation of BERT using Perturbed Masking
The paper introduces a novel technique termed "Perturbed Masking" for analyzing and interpreting pre-trained LLMs, notably BERT. Unlike prior methods that rely on probes with additional parameters for linguistic task evaluations, this approach offers a parameter-free probing mechanism. It leverages the intrinsic properties of BERT's masked LLMing (MLM) to infer syntactic patterns without explicit supervision or added parameters, thus aiming to reduce the confounding effects associated with parameterized probes.
Methodology
The Perturbed Masking technique assesses the impact of individual words on one another within BERT's contextual representation. By perturbing token sequences, specifically replacing select tokens with the [MASK] token, the technique calculates the disturbance in predicted contextual embeddings, thereby gauging inter-word relationships. Two variants of the distance metric are used to compute the perturbation impact: the Euclidean distance (Dist) and a probability-based difference (Prob).
A similar perturbation process is extended to analyze span-level structures in documents, allowing for the investigation of document-level discourse properties encoded within BERT.
Empirical Evaluation
The technique is applied across several linguistic tasks:
- Syntactic Analysis: By extracting impact matrices from token perturbation, syntactic trees are induced using algorithms like Eisner and Chu-Liu/Edmonds. Comparisons against baselines such as right-chain models demonstrate superior prediction of syntactic structures, although with a modest margin.
- Constituency Parsing: Utilizing a top-down parsing approach inspired by ON-LSTM, the Perturbed Masking technique shows competitive F1 scores on benchmarks like WSJ10 and PTB23, identifying phrase and clause structures without explicit syntactic supervision.
- Discourse Analysis: The approach analyzes document structure through discourse dependency parsing, producing impact matrices to evaluate EDU-level relationships. Despite falling short of linguistically-informed models, it underscores BERT's capacity for capturing longer context dependencies.
Implications in Downstream Applications
The paper investigates the potential utility of BERT-derived syntactic structures in sentiment classification tasks, such as Aspect Based Sentiment Classification (ABSC). While variations from human-designed parsers like SpaCy exist, empirical tests reveal comparable or improved performance using BERT's structures, indicating the model's proficiency in learning beneficial linguistic representations.
Conclusions and Future Directions
The Perturbed Masking approach provides an alternative pathway for understanding BERT's encoding of syntactic and discourse properties without inference biases introduced by additional probe parameters. By analyzing output perturbations, the technique aligns closely with the intrinsic operations of BERT, offering insights into its latent acquisition of language structure.
Future research could explore broader linguistic phenomena and expand validation across diverse downstream tasks to further elucidate the practical applicability of BERT's learned representations. Additionally, improvements in unsupervised dependency parsing methods might enhance the interpretability of LLMs, contributing to the development of more linguistically-informed architectures.