Show, Don't Tell: Aligning Language Models with Demonstrated Feedback (2406.00888v1)

Published 2 Jun 2024 in cs.CL and cs.HC

Abstract: LLMs are aligned to emulate the collective voice of many, resulting in outputs that align with no one in particular. Steering LLMs away from generic output is possible through supervised finetuning or RLHF, but requires prohibitively large datasets for new ad-hoc tasks. We argue that it is instead possible to align an LLM to a specific setting by leveraging a very small number ($<10$) of demonstrations as feedback. Our method, Demonstration ITerated Task Optimization (DITTO), directly aligns LLM outputs to a user's demonstrated behaviors. Derived using ideas from online imitation learning, DITTO cheaply generates online comparison data by treating users' demonstrations as preferred over output from the LLM and its intermediate checkpoints. We evaluate DITTO's ability to learn fine-grained style and task alignment across domains such as news articles, emails, and blog posts. Additionally, we conduct a user study soliciting a range of demonstrations from participants ($N=16$). Across our benchmarks and user study, we find that win-rates for DITTO outperform few-shot prompting, supervised fine-tuning, and other self-play methods by an average of 19% points. By using demonstrations as feedback directly, DITTO offers a novel method for effective customization of LLMs.

PDF HTML Abstract

An Overview of "Show, Don't Tell: Aligning LLMs with Demonstrated Feedback"

Authored by Omar Shaikh, Michelle Lam, Joey Hejna, Yijia Shao, Michael Bernstein, and Diyi Yang from Stanford University, this paper explores a pragmatic approach to aligning LLMs through user-provided demonstrations as feedback. The methodology, named Demonstration ITerated Task Optimization (DITTO), offers an innovative mechanism to tailor LLM outputs to user-specific preferences, requiring less than ten demonstrations.

Introduction and Motivation

LLMs are conventionally trained for general purposes, leading to outputs that often lack specificity for niche applications or personal preferences. Traditional methods like supervised fine-tuning and Reinforcement Learning from Human Feedback (RLHF) are effective but necessitate vast amounts of data, making them impractical for ad-hoc customization tasks. This paper addresses this limitation by proposing a method that leverages a minimal number of user demonstrations to achieve significant customization.

Methodology: Demonstration ITerated Task Optimization (DITTO)

DITTO is built on the premise of using a few user-provided examples to guide the LLM's outputs. Key aspects of DITTO include:

Data Collection and Initialization:
- Demonstrations: Users provide a small set of demonstrations, each illustrating the desired behavior for specific prompts.
- Initial Policy $\pi_0$ : A supervised fine-tuning (SFT) procedure on these demonstrations initializes the policy $\pi_0$ .
Generating Comparisons:
- Online Comparisons: By treating demonstrations as preferred over current model outputs, DITTO iteratively generates comparison data. As training progresses, new policies $\pi_t$ provide increasingly refined reference points.
- Replay and Intermodel Comparisons: Unlike other self-play methods, DITTO utilizes comparisons not only between the expert demonstrations and the current policy but also among policies from different training iterations.
Iterative Training:
- Policy Updates: Using the DPO (Direct Preference Optimization) method, DITTO updates the model iteratively by sampling and training on these comparisons. The fixed reference policy ensures stability and prevents drift from the initial model.

Experimental Evaluation

Benchmarks

Evaluation spans across diverse datasets, including author-specific emails, blogs, and news articles. Performance is quantified using GPT-4 as an evaluator for comparison, which selects the text most akin to human-authored content. The results demonstrate that DITTO outperforms traditional fine-tuning, few-shot prompting, and self-play methods like SPIN.

User Study

A user paper further validates DITTO in real-world settings. Participants provide demonstrations for personalized email-writing tasks, followed by assessments of generated outputs. Results show a significant preference for DITTO over zero-shot/few-shot prompts and supervised fine-tuning, indicating higher user satisfaction with customized outputs.

Implications and Future Directions

Practical Implications

The research underscores the efficacy of using minimal data for substantial customization, heralding implications for developing adaptive and personalized AI systems. The approach is particularly relevant for dynamic and subjective tasks where broad generalization is less applicable.

Theoretical Implications

The framework contributes to imitation learning literature, demonstrating that leveraging online comparisons and intermodel data can significantly enhance model alignment. By extrapolating beyond the demonstrated behavior, DITTO exemplifies effective policy optimization in data-constrained environments.

Speculation on Future Developments

Future research might explore refining DITTO further by optimizing data sampling strategies, reducing computational overhead, and integrating dynamic adaptation mechanisms to continuously refine user preferences. Additionally, examining the interplay between demonstration quality and alignment efficacy could yield insights into optimizing the demonstration collection process.

Conclusion

"Show, Don't Tell: Aligning LLMs with Demonstrated Feedback" presents an efficient and scalable solution for personalizing LLMs. By exploiting a small number of demonstrations, DITTO provides a compelling alternative to conventional customization techniques, paving the way for more accessible and context-specific AI applications. This research not only advances the state-of-the-art in model alignment but also sets a precedent for future AI development tuned to individual user needs.