BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field Language Model

Published 11 Feb 2019 in cs.CL and cs.LG | (1902.04094v2)

Abstract: We show that BERT (Devlin et al., 2018) is a Markov random field LLM. This formulation gives way to a natural procedure to sample sentences from BERT. We generate from BERT and find that it can produce high-quality, fluent generations. Compared to the generations of a traditional left-to-right LLM, BERT generates sentences that are more diverse but of slightly worse quality.

Abstract PDF Upgrade to Chat

Citations (334)

View on Semantic Scholar

Summary

The paper demonstrates that reframing BERT as a Markov random field enables text generation through Gibbs sampling.
The methodology employs pseudo log-likelihood to approximate joint probabilities, improving efficiency in high-dimensional spaces.
Experimental results reveal that BERT generates coherent and diverse text, outperforming traditional autoregressive models in output variety.

BERT as a Markov Random Field LLM

The paper "BERT has a Mouth, and It Must Speak: BERT as a Markov Random Field LLM" proposes a novel interpretation of BERT, framing it as a Markov random field LLM (MRF-LM). This perspective permits the use of BERT in text generation tasks, deviating from its traditional use primarily in understanding tasks. The authors focus on sampling methods to generate sentences, leveraging BERT's masked language modeling to propose a working algorithm based on Gibbs sampling.

Overview and Methodology

BERT, initially developed for language understanding tasks, uses a masked language modeling objective wherein words are predicted based on their surrounding context. This differs from traditional autoregressive LLMs that predict the next word given its preceding words. The authors address the challenge of employing BERT as a generative model by treating it as an MRF-LM, where the joint distribution of a sentence is captured through potential functions defined over a fully-connected graph of words.

Central to this approach is the formulation of the pseudo log-likelihood, which offers an approximation for the intractable computation of joint probabilities in high-dimensional spaces. This approximation facilitates the efficient training of BERT by maximizing the conditional probability of each word given its surrounding words, rather than the entire sequence.

Results

The experimental evaluation highlights BERT's capability to generate coherent and diverse text, even when benchmarked against models like OpenAI's GPT. The model's diversity in generation is particularly noteworthy, as BERT's samples exhibit lower self-BLEU scores, indicating reduced consolidation into repetitive patterns. This trait is attributed to the bidirectional context utilized during generation, setting it apart from the unidirectional generation of GPT.

The quality of the generations, although slightly inferior to GPT when judged by metrics such as perplexity, reveals that BERT-produced sequences are reasonably fluent. Human evaluations confirm this observation, with BERT's outputs being rated only slightly less fluent than those from GPT.

Implications and Future Directions

This framing of BERT as an MRF-LM has significant implications for both theoretical exploration and practical applications. Theoretically, it underpins the possibility of using BERT in probabilistic models and generative frameworks, bridging the gap between language understanding and generation tasks. Practically, this allows for the deployment of BERT in diverse applications that require sentence generation, enrichment, and ranking based on the scores derived from Gibbs sampling.

For future work, the paper suggests exploring more advanced MCMC sampling methods that are computationally efficient and capable of handling variable sentence lengths robustly. Improved sampling strategies that do not necessitate a full forward pass through the network with each iteration would be beneficial in reducing computational costs and increasing the flexibility of BERT-like models in practical applications.

Conclusion

This paper offers a compelling reinterpretation of BERT, extending its scope from a language understanding paradigm to a generative modeling framework. By demonstrating that BERT operates as a Markov random field LLM, the authors open a pathway for utilizing pretrained BERT architectures in diverse generative tasks. The proposed approach, focusing on Gibbs sampling, provides a foundation for enhancing text generation models by exploiting the rich, bidirectional context available in BERT, thereby contributing significant advancements to the field of natural language processing.

Markdown