Evaluating Infilling Capabilities in Autoregressive LLMs
Autoregressive LLMs have seen significant advances, particularly in open-ended text generation. Among these models, causal decoder-based architectures such as the GPT series have demonstrated superior performance compared to other paradigms like encoder-only and encoder-decoder models. However, a crucial capability missing in these models is text infilling—where the model generates text conditioned on both preceding and succeeding context.
This paper introduces a method to imbue causal decoder-based models with fill-in-the-middle (FIM) capabilities. The fundamental approach involves a simple data transformation where a middle span of text within a document is relocated to the end, facilitating the model to practice infilling during training. The authors proceed to investigate if this transformation influences the traditional left-to-right generative prowess of the model, confirming its effectiveness via extensive experiments and benchmarks.
Strong Numerical Results and Key Contributions
FIM-for-Free Property
In a pivotal discovery, the authors demonstrate what they term the "FIM-for-free" property: training models with a significant proportion of FIM-transformed data does not adversely affect their left-to-right generative performance. This claim is validated by training models with various proportions of FIM transformation (up to 90%) and evaluating their performance across standard autoregressive benchmarks. The left-to-right test loss for models incorporating FIM remained consistent with those trained without, suggesting an efficient integration of the text infilling capability.
Extensive Hyperparameter Ablations
The authors meticulously explore several hyperparameters:
- FIM Rate: Higher rates up to 90% improve infilling capabilities without degrading left-to-right generation.
- Transformation Implementation: Context-level FIM generally outperforms document-level FIM.
- Order of Concatenation: The paper identifies that switching the order of prefix and suffix (SPM mode: suffix, prefix, middle) is generally more effective than the reverse order (PSM mode: prefix, suffix, middle).
- Span Selection: Random character-level spans for the middle piece are more effective compared to line or token level spans.
Finetuning vs. Pretraining
A notable insight from the paper is the differential efficiency of pretraining and finetuning to acquire FIM capabilities:
- Pretraining: Embedding FIM capability during pretraining is computationally efficient and retains the model's left-to-right generation capabilities.
- Finetuning: Retrofitting FIM capability through finetuning requires substantial additional compute resources without achieving the same level of performance as models trained with FIM from scratch.
Practical and Theoretical Implications
Practical Implications
- Training Efficiency: The FIM-for-free property suggests that training future LLMs should routinely incorporate a mixture of left-to-right and FIM-transformed data to enhance their versatility without extra cost.
- Robust Performance: Using randomly selected spans, particularly at the character level, introduces robustness, enabling these models to handle real-world scenarios where infill regions do not align neatly with token boundaries.
Theoretical Implications
- Attention Mechanisms: The research highlights the importance of understanding how different attention patterns (enabled by the autoregressive nature of infilling) affect the learning dynamics of LLMs.
- Bidirectional Context Utilization: FIM training implicitly allows models to leverage future context, an attribute generally absent in canonical left-to-right generators, indicating avenues for architectural adaptations.
Future Directions
The paper proposes several future directions:
- Enhanced Span Selection: Leveraging semantically meaningful spans could further improve infilling performance.
- Steerable Generation: Techniques like Reinforcement Learning from human feedback and instruction-following could align the model's outputs more closely with user intent.
- Multiple Infilling Slots: Investigating how models can handle multiple infilling regions within a single context could broaden application scenarios.
- Evaluation and Real-World Applications: Developing benchmarks that better simulate real-world infilling tasks, particularly for natural language, remains crucial.
Conclusion
This paper establishes autoregressive models as efficient generators for diverse text completion tasks, including infilling. The FIM-for-free property offers a compelling argument for adopting FIM training as a new standard, ensuring that LLMs are equipped with versatile capabilities without sacrificing traditional performance metrics. The findings and methodologies provided pave the way for future exploration and operational deployment of more adaptable and robust LLMs.