A Study on Consistency in Pretrained LLMs
The paper "Measuring and Improving Consistency in Pretrained LLMs" addresses the crucial aspect of model consistency in Pretrained LLMs (PLMs), specifically their ability to maintain invariant behavior under paraphrased input while handling factual knowledge. The authors introduce ParaRel, a substantial dataset consisting of 328 English paraphrase patterns spanning 38 relations employed to scrutinize the consistency of PLMs such as BERT, RoBERTa, and ALBERT.
Key Findings
A notable revelation from the paper is the recognition of poor consistency across all evaluated PLMs, with high variance observed across different relations. The analysis highlights that while these models are adept at certain language tasks, they may not be structured robustly to encode knowledge in a consistent manner. This inconsistency poses limitations when such models are considered for roles resembling Knowledge Bases (KBs), which demand a high degree of consistency.
The paper details a meticulous examination of consistency via cloze-style queries — where paraphrased queries testing the same relation should yield identical outputs if a model is consistent. Results demonstrate that PLMs frequently fail to produce consistent predictions, notably when varied syntactic structures are employed.
Methodological Contributions
The authors contribute a novel method to enhance model consistency by incorporating a customized loss function during further pretraining phases. This consistency loss leverages KL Divergence to align predicted distributions across paraphrases and aims to fortify the representational integrity of extracted knowledge. Experimental evidence showcases the effectiveness of this methodology, with BERT demonstrating marked improvement in consistency after being trained using the proposed approach.
Implications and Future Directions
This research holds profound implications for the development of PLMs. Firstly, it underscores an unmet expectation: the transfer of consistency as an inherent property from pretraining to downstream applications. Addressing this gap can reduce the need for separate consistency-targeted redesigns in NLP systems.
Additionally, the work advocates for a refined approach to data selection in pretraining practices. The paper suggests that familiarity with the primary text corpus, like Wikipedia for BERT, potentially enhances the model's consistency and accuracy, thereby questioning the broader efficacy of simply increasing data volume in pretraining.
Looking forward, the authors emphasize the essential nature of consistency across a broader spectrum of linguistic transformations, such as negation, inference, and antonymy, highlighting these as subsequent steps toward achieving a consistent, reliable LLM.
The paper offers a pivotal resource in ParaRel, enabling the research community to evaluate and enhance the consistency properties of emergent PLMs. It invites further exploration into integrating consistency as a core attribute of LLM training endeavors, thereby bridging gaps between pattern recognition, factual knowledge encoding, and real-world application demands.