Investigate Bidirectional Pretraining in Mistral Models

Ascertain whether Mistral-7B and related Mistral models were pre-trained using bidirectional attention mechanisms such as prefix language modeling, and rigorously analyze why enabling bidirectional attention without any additional training yields high cosine similarity between causal and bidirectional token representations and strong unsupervised performance with bidirectional attention.

Background

The paper observes that enabling bidirectional attention without training substantially changes hidden representations for S-LLaMA-1.3B and LLaMA-2-7B, but not for Mistral-7B, where representations under causal and bidirectional attention remain highly similar across layers and positions.

This unusual stability correlates with Mistral-7B’s strong out-of-the-box performance on embedding tasks when bidirectional attention is enabled. The authors therefore hypothesize that some form of bidirectional attention (e.g., prefix language modeling) may have been used during Mistral’s pretraining and explicitly defer a detailed investigation.

References

Based on these findings (we replicate these results for other inputs and other Mistral models in \Cref{sec:appendix:analysis}) and the strong unsupervised results for Mistral-7B with bidirectional attention, we speculate that Mistral models are pre-trained with some form bidirectional attention, e.g., prefix language modeling \citep{raffel-etal-2020-t5} -- at least for some parts of its training. We leave a more detailed investigation of this intriguing behavior for future work.

LLM2Vec: Large Language Models Are Secretly Powerful Text Encoders (2404.05961 - BehnamGhader et al., 9 Apr 2024) in Section 4.2 ("Why does bidirectional attention without training work for Mistral models?")