BERTs are Generative In-Context Learners (2406.04823v2)

Published 7 Jun 2024 in cs.CL and cs.AI

Abstract: While in-context learning is commonly associated with causal LLMs, such as GPT, we demonstrate that this capability also 'emerges' in masked LLMs. Through an embarrassingly simple inference technique, we enable an existing masked model, DeBERTa, to perform generative tasks without additional training or architectural changes. Our evaluation reveals that the masked and causal LLMs behave very differently, as they clearly outperform each other on different categories of tasks. These complementary strengths suggest that the field's focus on causal models for in-context learning may be limiting - both architectures can develop these capabilities, but with distinct advantages; pointing toward promising hybrid approaches that combine the strengths of both objectives.

PDF HTML Abstract

An Expert Analysis of "BERTs are Generative In-Context Learners"

The paper "BERTs are Generative In-Context Learners" offers a compelling exploration into the in-context learning capabilities of masked LLMs (MLMs), notably DeBERTa. This paper challenges the entrenched notion that MLMs lack the emergent ability to perform in-context learning—a domain traditionally dominated by causal LLMs such as GPT-3.

Contribution and Method

The core contribution of this paper lies in the introduction of an inference technique that enables DeBERTa to function as a generative model without additional training. By reformatting the sequence of input tokens, the authors have demonstrated that DeBERTa can match and, in specific tasks, surpass GPT-3. The paper uses publicly available checkpoints of DeBERTa to show its versatility not only in encoding text but also in text generation and text completion ranking.

The paper methodologically divides its content into several sections:

Method: This section elaborates on two primary techniques—text generation and ranking. In the text generation method, the authors introduce additional [MASK] tokens to accommodate the end-of-sequence token, facilitating a straightforward left-to-right autoregressive generation. The ranking method improves upon previous pseudo-log-likelihood scoring by masking additional tokens, thereby reducing local dependency issues.
Evaluation: DeBERTa's generative abilities were compared against GPT-3 using standard benchmarks like SuperGLUE, natural language understanding tasks, machine translation, and question-answering datasets.
Scaling and Length Generalization: The paper evaluated how DeBERTa generalizes to longer sequences and how its performance scales with model size.

Key Findings

Language Understanding: DeBERTa outperformed a similarly-sized GPT-3 model on several SuperGLUE tasks. Specifically, DeBERTa showed notable strength in tasks like Boolean Questions (BoolQ) and CommitmentBank (CB), achieving higher accuracy and F1-scores.
LLMing and Text Completion: DeBERTa demonstrated superior performance on tasks such as HellaSwag and StoryCloze, even though GPT-3 is designed for such tasks. The scaling abilities of DeBERTa were on par with those of GPT-3.
Machine Translation: Here, DeBERTa lagged behind GPT-3. The authors speculate that the relatively small and clean pretraining corpus of DeBERTa, which lacks multilingual data, might have contributed to this shortcoming.
Question Answering and Commonsense Reasoning: While DeBERTa performed comparably in commonsense reasoning tasks, it was weaker in closed-book question answering, consistent with the assertion that MLMs leverage rich bidirectional context during training.

Implications and Speculation

The findings suggest that the in-context learning capability is not exclusively tied to causal LLMs. Instead, it's a general phenomenon that extends to MLMs like DeBERTa. This revelation opens up new avenues for hybrid models that integrate the strengths of both masked and causal LLMs. Such hybrid models could potentially improve tasks requiring generative capabilities and understanding grounded in context.

The paper also points out practical limitations associated with MLMs in generative tasks, such as the inability to cache intermediate self-attention key and value vectors, resulting in slower inference times. Addressing these limitations in future research could lead to more efficient techniques for using MLMs in generative scenarios.

Future Developments

Future research should consider:

Pretraining on Larger Corpora: Scaling up the pretraining corpus and incorporating multilingual data could enhance performance in tasks where DeBERTa currently falls behind.
Hybrid Training Objectives: Exploring combined training approaches that leverage the bidirectional context of MLMs and the autoregressive nature of causal models could yield models with superior performance across a broader range of tasks.
Optimization Techniques: Further optimizations to mitigate the slower inference bottlenecks and facilitate longer context lengths will be crucial for deploying MLMs in practical applications.

Conclusion

This paper convincingly demonstrates that MLMs like DeBERTa have untapped potential for in-context learning, challenging the prevailing assumptions in the field. Through systematic evaluation and novel inference techniques, the paper provides strong evidence that MLMs can be competitive with causal LLMs, suggesting a fertile ground for future research in hybrid models and more efficient inference solutions.