Aligning LLMs with Individual Preferences via Interaction
The research paper introduces a methodology for aligning LLMs to individual user preferences through dynamic interaction. The objective is to enhance the alignment of LLMs with human values by moving beyond generalized alignment strategies centered on broad principles like helpfulness, harmlessness, and honesty. Instead, this paper focuses on cultivating the ability of LLMs to infer unspoken personalized preferences and adapt their responses accordingly during multi-turn conversations.
Technical Approach:
- Persona Construction: The authors construct diverse user personas to guide LLM interactions. Beginning with 20 seed profiles, they employ iterative self-generation and semantic filtering driven by GPT-4o to expand the dataset to 3,310 distinct personas. These personas include detailed profiles, governing conversation topics, and complex personality traits, influencing communication style.
- Data Formation: With these personas, a multi-turn preference dataset is developed using a multi-LLM collaboration framework. This involves role-playing to simulate user interactions, extracting persona-revealed information for personalized response creation, and generating responses using four LLMs that toggle between 'preferred' and 'rejected' outputs. The result is a tree-structured dataset conducive to training.
- Training and Fine-Tuning Techniques: The preference dataset supports a two-phase training methodology:
- Supervised Fine-Tuning (SFT): Fine-tuning is initially applied using only the 'preferred' responses, establishing a foundational model capable of dynamic alignment interactions.
- Reinforcement Learning (RL): Subsequently, the model undergoes optimization via Direct Preference Optimization (DPO). This stage refines response calibration and alignment through pairwise preference data.
Evaluation and Benchmarks:
The authors present a benchmark called ALOE (ALign with custOmized prEferences), consisting of 100 curated cases with well-defined metrics to evaluate LLMs' capacity for preference alignment. The evaluation leverages an LLM-as-a-Judge model to assess alignment levels. Results indicate a significant performance enhancement in personalized alignment capabilities among mainstream LLMs using the proposed methodology, with average relative improvements in alignment levels of up to 32%.
Implications and Future Directions:
This research highlights the potential of LLMs in providing personalized user experiences, marking an important consideration in AI alignment research. The ability to understand and dynamically adapt to individual user preferences can significantly impact application interfaces, enhancing interaction quality and inclusivity across demographics. The framework offers scalability for broader applications, suggesting a promising pathway for further expanding LLM training to accommodate more nuanced human-LLM interactions. Future work may explore extending turn contexts for deeper conversational nuances and expanding the benchmark dataset for broader evaluation insights.