- The paper presents the Suri dataset and I-ORPO method to enhance LLMs' ability to follow complex, multi-constraint instructions in long-form text generation.
- It leverages backtranslation and syntactically corrupted instructions to fine-tune models, resulting in texts averaging over 5,000 tokens with improved coherence.
- Human evaluations and ranking experiments demonstrate that Suri-I-ORPO outperforms baseline models by at least 10% in distinguishing correct from corrupted instructions.
Suri: Multi-constraint Instruction Following for Long-form Text Generation
The paper "Suri: Multi-constraint Instruction Following for Long-form Text Generation" presents an in-depth paper on enhancing the instruction-following capabilities of LLMs in the context of generating long-form text. Authored by Chau Minh Pham, Simeng Sun, and Mohit Iyyer from the University of Massachusetts Amherst, the paper explores complex, multi-constraint instruction following—a topic that has been underexplored in the domain of LLM-based text generation.
Overview
The paper introduces Suri, a dataset comprising 20,000 human-written long-form texts accompanied by LLM-generated backtranslated instructions containing multiple complex constraints. The Suri dataset is notable for its combination of long-form outputs (up to 5,024 tokens) and intricate, multi-faceted instructions. This union offers a unique opportunity to fine-tune LLMs to follow complex directives over extended textual spans, a feat that traditional datasets like Alpaca have not achieved.
Methodology
The authors outline the creation and utilization of the Suri dataset through two primary contributions:
- Dataset Construction:
- The dataset includes long human-written texts sourced from existing corpora such as ChapterBreak, Books3, and RedPajama-Data-v2.
- Backtranslation techniques are used to generate comprehensive instructions for these texts, followed by the creation of syntactically corrupted instructions to facilitate preference tuning.
- Alignment Using I-ORPO:
- The paper introduces Instructional Odds Ratio Preference Optimization (I-ORPO), a variant of the ORPO algorithm.
- I-ORPO uses synthetically corrupted instructions instead of dispreferred responses, which proves to be a robust alignment strategy for LLMs when human preference data on long-form text are impractical to obtain.
Key Findings
Evaluations and Results:
- Both Suri-I-ORPO and Suri-SFT (supervised fine-tuning) models produced texts averaging 5,100 and 4,800 tokens respectively, markedly longer than those generated by baseline models such as Mistral-7B-Instruct-v0.2 and Llama-3-8B-Instruct.
- Human evaluations suggest a strong preference for Suri-I-ORPO over Suri-SFT due to coherence, informativeness, and readability.
- The models maintained low levels of n-gram repetitions, indicating sustained textual quality even with long-form outputs.
- Ranking accuracy experiments revealed that Suri-I-ORPO achieved a significant improvement (at least 10%) over baseline models in distinguishing between correct and corrupted instructions.
Implications
Theoretical Implications:
- The introduction of multi-constraint instructions combined with extended text generation challenges existing paradigms in LLM fine-tuning and necessitates more complex alignment techniques.
- The success of I-ORPO suggests that leveraging corrupted instructions as negative feedback can be a viable approach in lacking human preference data, a scenario often encountered in real-world applications.
Practical Implications:
- Suri-I-ORPO's ability to generate coherent long-form text with intricate constraints could significantly benefit industries requiring detailed report generation, creative writing, and comprehensive content creation.
- The methods proposed could be adapted for LLMs in other languages and genres, broadening the utility of advanced instruction-following models.
Future Directions
Examining how different LLM architectures respond to fine-tuning using the Suri dataset could yield further insights into model-specific intricacies. Additionally, exploring the influence of surface features, such as instruction length, and varying the degree of constraint violations could refine the I-ORPO method. Lastly, testing these models on shorter-context tasks would help understand any trade-offs associated with optimizing for long-form generation.
Conclusion
"Suri: Multi-constraint Instruction Following for Long-form Text Generation" offers a comprehensive methodology and dataset for enhancing LLM capabilities in following complex instructions over long textual spans. By introducing the Suri dataset and the I-ORPO alignment method, the authors provide valuable contributions to the field of AI-driven text generation, paving the way for more advanced and nuanced LLM applications.