Fine-tuning language models to find agreement among humans with diverse preferences (2211.15006v1)

Published 28 Nov 2022 in cs.LG and cs.CL

Abstract: Recent work in LLMing (LLMs) has used fine-tuning to align outputs with the preferences of a prototypical user. This work assumes that human preferences are static and homogeneous across individuals, so that aligning to a a single "generic" user will confer more general alignment. Here, we embrace the heterogeneity of human preferences to consider a different challenge: how might a machine help people with diverse views find agreement? We fine-tune a 70 billion parameter LLM to generate statements that maximize the expected approval for a group of people with potentially diverse opinions. Human participants provide written opinions on thousands of questions touching on moral and political issues (e.g., "should we raise taxes on the rich?"), and rate the LLM's generated candidate consensus statements for agreement and quality. A reward model is then trained to predict individual preferences, enabling it to quantify and rank consensus statements in terms of their appeal to the overall group, defined according to different aggregation (social welfare) functions. The model produces consensus statements that are preferred by human users over those from prompted LLMs (>70%) and significantly outperforms a tight fine-tuned baseline that lacks the final ranking step. Further, our best model's consensus statements are preferred over the best human-generated opinions (>65%). We find that when we silently constructed consensus statements from only a subset of group members, those who were excluded were more likely to dissent, revealing the sensitivity of the consensus to individual contributions. These results highlight the potential to use LLMs to help groups of humans align their values with one another.

Authors (11)

Michiel A. Bakker (11 papers)
Martin J. Chadwick (4 papers)
Hannah R. Sheahan (2 papers)
Michael Henry Tessler (13 papers)
Lucy Campbell-Gillingham (5 papers)
Jan Balaguer (8 papers)
Nat McAleese (11 papers)
Amelia Glaese (14 papers)
John Aslanides (16 papers)
Matthew M. Botvinick (14 papers)
Christopher Summerfield (22 papers)

Citations (183)

View on Semantic Scholar

Summary

The paper introduces a novel approach by fine-tuning a 70B-parameter LLM to generate consensus statements that align with diverse human preferences.
It employs a reward model to predict individual opinions and rank outputs using social welfare functions, outperforming traditional prompting methods.
The study highlights the potential to reduce polarization by integrating diverse views, marking a significant advance in aligning AI with heterogeneous human values.

Analyzing Fine-Tuning of LLMs for Consensus Generation

The paper "Fine-tuning LLMs to find agreement among humans with diverse preferences" by Michiel A. Bakker et al. presents an innovative approach to improving the alignment of LLMs with the heterogeneous preferences of human groups. Unlike traditional methods, which often assume homogeneous user preferences, this paper embraces diversity and explores how LLMs can assist in consensus-building among people with divergent views.

To address this challenge, the researchers fine-tuned a 70 billion parameter LLM, Chinchilla, focusing on its ability to generate consensus statements that maximize group approval. These statements were based on individual opinions from human participants on politically charged topics. The participants rated these machine-generated consensus statements for agreement and quality. The standout feature of this research is a reward model trained to predict individual preferences, subsequently ranking these consensus statements to reflect group appeal, measured through various social welfare functions.

Key findings indicate that the model's consensus statements were preferred by over 70% of human users compared to those from prompted LLMs, and notably, over 65% of the model's outputs were favored more than the best human-generated opinions. Additionally, the sensitivity of consensus to individual contributions was evidenced when consensus statements constructed with partial group opinions led to more dissent among those excluded, underlining the model's awareness of individual inputs.

The implications of this research are considerable in both theoretical and practical domains of AI. Theoretically, it challenges the pervasive assumption of static user preferences, proposing a more nuanced interaction between AI systems and human values. Practically, it suggests a framework where AI systems promote not only understanding but also alignment among diverse human beliefs, potentially mitigating polarization exacerbated by technology.

Future developments could focus on extending these findings to larger groups, scaling the model's usefulness further. Moreover, exploring the interaction of different social welfare functions could illuminate their impact on consensus generation, addressing one of the paper's highlighted limitations. This research not only broadens the scope of AI alignment but also marks a step forward in crafting AI systems that can facilitate human consensus in contentious domains.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ma_tay_/status/1756002720233013672

https://twitter.com/hannahrosekirk/status/1759306058894627153