Investigate Bidirectional Pretraining in Mistral Models
Ascertain whether Mistral-7B and related Mistral models were pre-trained using bidirectional attention mechanisms such as prefix language modeling, and rigorously analyze why enabling bidirectional attention without any additional training yields high cosine similarity between causal and bidirectional token representations and strong unsupervised performance with bidirectional attention.
Sponsor
References
Based on these findings (we replicate these results for other inputs and other Mistral models in \Cref{sec:appendix:analysis}) and the strong unsupervised results for Mistral-7B with bidirectional attention, we speculate that Mistral models are pre-trained with some form bidirectional attention, e.g., prefix language modeling \citep{raffel-etal-2020-t5} -- at least for some parts of its training. We leave a more detailed investigation of this intriguing behavior for future work.