This paper introduces PUGC (Preference Alignment using User-Generated Content), a novel framework for aligning LLMs with human preferences by leveraging implicit signals found in unlabeled user-generated content (UGC). Traditional alignment methods like RLHF and DPO rely on costly, manually curated preference data or data generated by powerful LLMs like GPT-4, which limits scalability and domain flexibility. PUGC aims to overcome these limitations by using abundant and diverse UGC as a scalable source of preference information.
The core idea of PUGC is to transform UGC into preference data pairs , where is a user query, is a preferred response, and is a rejected response. The process involves several steps:
- Instruction Generation: An LLM (specifically, the base or instruct SFT model, like Llama-3-70B-Instruct in the paper's experiments) is used to generate a potential "reader query" from a piece of UGC . This query represents the question the original UGC author might have been implicitly answering. The prompt for this step is designed to elicit a self-contained query based on the UGC context (see Prompt for Reader Question Generation in Appendix). The transformation can be expressed as , where constructs the prompt and is the generating LLM.
- Instruction Filtering: To ensure the generated instruction is relevant and answerable by the UGC , the same LLM is used to filter the instruction-UGC pair. A filtering prompt (see Prompt for Question Filtering in Appendix) is designed to check if the UGC contains sufficient relevant information for the instruction. Only pairs where the filter function returns
True
are kept, forming a set of high-quality instructions . - Response Sampling: For each filtered instruction , the policy model samples candidate responses . The experiments use temperature 0.8 and nucleus sampling ().
- Response Scoring: A reward model (RM) scores each sampled response for the instruction . Crucially, PUGC feeds the original UGC as a reference text to the reward model alongside the instruction and response. This step leverages the implicit user preferences embedded in the UGC as guidance for the reward model. The scoring is . The paper uses Prometheus-7b-v2.0 [kim2024prometheus] as the RM, noting its ability to incorporate reference answers and using self-consistency decoding (N=8).
- Preference Data Formation: For each instruction , the sampled response with the highest score is selected as the preferred response, and the response with the lowest score is selected as the rejected response. This forms the preference pair which is used for downstream preference tuning (e.g., DPO, SimPO). Ties in scoring are resolved by favoring shorter high-scoring responses and longer low-scoring responses to mitigate length bias.
PUGC is designed to be compatible with various offline preference optimization algorithms like DPO [rafailov2024direct] and SimPO [meng2024simpo]. The preference data generated by the PUGC pipeline is used to train the policy model .
The experiments in the paper demonstrate the practical effectiveness of PUGC using 60k pieces of high-quality UGC from the Dolma dataset [soldaini-etal-2024-dolma].
- Performance Benchmarks: On Alpaca Eval 2.0, models trained with PUGC+DPO achieved a state-of-the-art length-controlled win rate of 35.93% with Mistral-7B-Instruct, significantly outperforming baselines trained on UltraFeedback data (26.56%). PUGC also showed consistent gains with SimPO. On MT-Bench, PUGC generally outperformed baselines. (Table 1)
- Reward Quality: Analysis shows that using the UGC as a reference during reward scoring substantially improves the reward model's agreement with GPT-4-Turbo and human judgments (Figure 3, Appendix A, B). The reward scores are also more fine-grained (Appendix A).
- Data Quantity vs. Quality: Increasing the quantity of UGC data leads to significant performance improvements, while PUGC remains robust to variations in UGC quality within tested ranges (Figure 4).
- Domain-Specific Alignment: Using domain-specific UGC (Goodreads book reviews), PUGC effectively aligns the model to perform better on tasks related to that domain, achieving a 7% higher win rate on book review prompts compared to training with general UGC (Figure 5).
- Fine-grained Analysis: PUGC shows notable performance gains on Alpaca Eval categories common in UGC, such as general knowledge, historical topics, writing tasks (reviews, letters), critique, and hypothetical scenarios (Figure 6). It performs particularly well on more complex instructions and those requiring longer responses.
- Theory of Mind: PUGC significantly enhances the model's theory of mind capabilities, outperforming other open-source models and approaching GPT-4 performance on the BigGen Bench ToM evaluation (Table 2, Appendix G). This is attributed to the rich implicit preference signals related to user intentions, beliefs, and emotions in UGC.
- Ablation Studies: Removing the UGC reference during reward scoring results in a large performance drop, confirming its importance. The choice of reward model is also critical, with Prometheus performing much better than Skywork-Reward-Llama-3.1-8B, likely because Prometheus was trained to incorporate reference answers (Table 3, Appendix C).
- Online Iterative Training: Initial experiments suggest that using PUGC in an online iterative training setting can further improve performance (Table 4).
- Safety: Despite UGC potentially containing unsafe content, models aligned with PUGC maintain or slightly improve safety performance compared to the SFT baseline and are safer than models aligned with UltraFeedback on-policy data (Table 5, Appendix F).
PUGC provides a scalable and cost-effective method to generate high-quality preference data by extracting implicit user preferences from unlabeled UGC. This approach is particularly valuable for domain-specific alignment where curated data is scarce.
Implementation Considerations:
- Data Sourcing: Requires access to large quantities of UGC. The paper uses subsets of the Dolma corpus but notes applicability to platforms like Reddit, StackExchange, reviews, etc. Ethical considerations regarding privacy and consent for using public UGC must be addressed.
- LLM for Instruction Generation/Filtering: The quality of the generated instructions depends on the capability of the LLM used. While the paper found instruction quality less critical than other factors, using a strong model is beneficial.
- Reward Model Selection: The reward model must be capable of effectively utilizing reference text (the original UGC) for scoring. The paper highlights Prometheus-7b-v2.0 as suitable due to its training procedure. Developing RMs specifically trained for reference-based evaluation on diverse UGC domains is a potential area for improvement.
- Computational Resources: Training preference models like DPO or SimPO requires significant GPU resources. The experiments were conducted on a node with 8x NVIDIA A100-SXM4-40GB GPUs. The data generation pipeline itself also requires compute for instruction generation, filtering, response sampling, and reward scoring.
- Domain Applicability: PUGC is effective where relevant UGC is available. Its performance might be limited in domains like complex math or coding if corresponding high-quality UGC with implicit preference signals is scarce, as seen in the benchmarks (Appendix H).
- Safety Mitigation: Although empirical results show robustness, integrating explicit safety mechanisms during the UGC processing or training phase could further enhance safety.
The authors provide their code and dataset at https://zhaoxuan.info/PUGC.github.io/. The detailed prompts used for instruction generation, filtering, and reward scoring are included in the appendix (Appendix D, E).