Efficient Training of Language Models to Fill in the Middle (2207.14255v1)

Published 28 Jul 2022 in cs.CL

Abstract: We show that autoregressive LLMs can learn to infill text after we apply a straightforward transformation to the dataset, which simply moves a span of text from the middle of a document to its end. While this data augmentation has garnered much interest in recent years, we provide extensive evidence that training models with a large fraction of data transformed in this way does not harm the original left-to-right generative capability, as measured by perplexity and sampling evaluations across a wide range of scales. Given the usefulness, simplicity, and efficiency of training models to fill-in-the-middle (FIM), we suggest that future autoregressive LLMs be trained with FIM by default. To this end, we run a series of ablations on key hyperparameters, such as the data transformation frequency, the structure of the transformation, and the method of selecting the infill span. We use these ablations to prescribe strong default settings and best practices to train FIM models. We have released our best infilling model trained with best practices in our API, and release our infilling benchmarks to aid future research.

PDF Abstract

Evaluating Infilling Capabilities in Autoregressive LLMs

Autoregressive LLMs have seen significant advances, particularly in open-ended text generation. Among these models, causal decoder-based architectures such as the GPT series have demonstrated superior performance compared to other paradigms like encoder-only and encoder-decoder models. However, a crucial capability missing in these models is text infilling—where the model generates text conditioned on both preceding and succeeding context.

This paper introduces a method to imbue causal decoder-based models with fill-in-the-middle (FIM) capabilities. The fundamental approach involves a simple data transformation where a middle span of text within a document is relocated to the end, facilitating the model to practice infilling during training. The authors proceed to investigate if this transformation influences the traditional left-to-right generative prowess of the model, confirming its effectiveness via extensive experiments and benchmarks.

Strong Numerical Results and Key Contributions

FIM-for-Free Property

In a pivotal discovery, the authors demonstrate what they term the "FIM-for-free" property: training models with a significant proportion of FIM-transformed data does not adversely affect their left-to-right generative performance. This claim is validated by training models with various proportions of FIM transformation (up to 90%) and evaluating their performance across standard autoregressive benchmarks. The left-to-right test loss for models incorporating FIM remained consistent with those trained without, suggesting an efficient integration of the text infilling capability.

Extensive Hyperparameter Ablations

The authors meticulously explore several hyperparameters:

FIM Rate: Higher rates up to 90% improve infilling capabilities without degrading left-to-right generation.
Transformation Implementation: Context-level FIM generally outperforms document-level FIM.
Order of Concatenation: The paper identifies that switching the order of prefix and suffix (SPM mode: suffix, prefix, middle) is generally more effective than the reverse order (PSM mode: prefix, suffix, middle).
Span Selection: Random character-level spans for the middle piece are more effective compared to line or token level spans.

Finetuning vs. Pretraining

A notable insight from the paper is the differential efficiency of pretraining and finetuning to acquire FIM capabilities:

Pretraining: Embedding FIM capability during pretraining is computationally efficient and retains the model's left-to-right generation capabilities.
Finetuning: Retrofitting FIM capability through finetuning requires substantial additional compute resources without achieving the same level of performance as models trained with FIM from scratch.

Practical and Theoretical Implications

Practical Implications

Training Efficiency: The FIM-for-free property suggests that training future LLMs should routinely incorporate a mixture of left-to-right and FIM-transformed data to enhance their versatility without extra cost.
Robust Performance: Using randomly selected spans, particularly at the character level, introduces robustness, enabling these models to handle real-world scenarios where infill regions do not align neatly with token boundaries.

Theoretical Implications

Attention Mechanisms: The research highlights the importance of understanding how different attention patterns (enabled by the autoregressive nature of infilling) affect the learning dynamics of LLMs.
Bidirectional Context Utilization: FIM training implicitly allows models to leverage future context, an attribute generally absent in canonical left-to-right generators, indicating avenues for architectural adaptations.

Future Directions

The paper proposes several future directions:

Enhanced Span Selection: Leveraging semantically meaningful spans could further improve infilling performance.
Steerable Generation: Techniques like Reinforcement Learning from human feedback and instruction-following could align the model's outputs more closely with user intent.
Multiple Infilling Slots: Investigating how models can handle multiple infilling regions within a single context could broaden application scenarios.
Evaluation and Real-World Applications: Developing benchmarks that better simulate real-world infilling tasks, particularly for natural language, remains crucial.

Conclusion

This paper establishes autoregressive models as efficient generators for diverse text completion tasks, including infilling. The FIM-for-free property offers a compelling argument for adopting FIM training as a new standard, ensuring that LLMs are equipped with versatile capabilities without sacrificing traditional performance metrics. The findings and methodologies provided pave the way for future exploration and operational deployment of more adaptable and robust LLMs.