WebGPT: Browser-assisted question-answering with human feedback (2112.09332v3)

Published 17 Dec 2021 in cs.CL, cs.AI, and cs.LG

Abstract: We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

PDF Abstract

WebGPT: Browser-Assisted Question-Answering with Human Feedback

The paper presents WebGPT, a novel approach to improving long-form question-answering (LFQA) by leveraging a LLM fine-tuned to interact within a text-based web-browsing environment. This paper outlines how the integration of web-based information retrieval and subsequent refinement through human feedback can significantly enhance the LFQA capabilities of a pre-trained model, specifically GPT-3.

Introduction and Motivation

Current LFQA systems have shown limitations in generating high-quality answers, primarily due to challenges in retrieving relevant information and synthesizing responses. Previous methods have often succeeded in either information retrieval or synthesis but have struggled to integrate these components effectively. WebGPT seeks to address these challenges by utilizing a familiar web-search API (Bing) for document retrieval and fine-tuning GPT-3 for synthesis. The key innovation of WebGPT lies in its environment design and training framework, which integrates imitation learning, reinforcement learning (RL), and human feedback to enhance accuracy and coherence in responses.

Environment Design

At the core of WebGPT is a custom-built text-based web-browsing environment where a model interacts with web pages to retrieve and synthesize information. The environment includes capabilities for performing search queries, navigating through search results, selecting relevant links, quoting text, and formulating final answers. Human demonstrators initially perform these tasks in a graphical interface, creating a dataset of demonstrations that the model uses for behavior cloning (BC).

Training Methodologies

The training process for WebGPT involves several stages:

Behavior Cloning (BC): The model undergoes supervised fine-tuning based on the demonstrations provided by human users interacting with the web-browsing environment. This stage ensures the model can mimic human browsing behavior.
Reward Modeling (RM): A reward model is trained using comparisons between pairs of model-generated answers. Human labelers provide preference judgments, creating a dataset that quantifies human preferences. This reward model predicts these preferences, allowing subsequent optimization.
Reinforcement Learning (RL): Leveraging Proximal Policy Optimization (PPO), the model further fine-tunes its browsing and answering capabilities. The reward model score at the end of episodes, combined with a KL-divergence penalty, guides the optimization.
Rejection Sampling (Best-of-n): This technique involves sampling multiple answers and selecting the highest-scoring one according to the trained reward model. This procedure requires no further training but uses additional inference-time compute.

Evaluation and Results

WebGPT demonstrates substantial improvements in two key evaluations on the ELI5 dataset:

Comparison with Human Demonstrators: The finest WebGPT model's answers are preferred 56% of the time over those written by human demonstrators, suggesting competitive or superior performance to humans when using the designed browsing environment.
Comparison with ELI5 Reddit Answers: When compared with the top-voted answers from Reddit, WebGPT's best model generates preferred answers 69% of the time, significantly surpassing previous benchmarks.

The evaluation also extends to TruthfulQA, where the WebGPT models outperform base GPT-3 models, particularly in balancing truthfulness and informativeness, thus indicating an enhanced capability in handling adversarial questions.

Implications and Future Directions

The enhancements in LFQA demonstrated by WebGPT's approach have significant implications. Practically, such systems can provide more accurate and referenced information, which is crucial for applications requiring reliable automated responses. Theoretically, the integration of human feedback into training paradigms marks a promising direction in improving model interpretability and aligning outputs with human evaluative standards.

Speculation on Future Developments

Future research can build on the findings of WebGPT by exploring:

Adversarial Training: Incorporating adversarially selected questions to further enhance the robustness of information retrieval and synthesis.
Exploration in RL: Refining exploration strategies in RL to better align with human evaluative metrics and further reduce overoptimization risks.
Cross-disciplinary Criteria: Developing more robust and epistemically sound factual accuracy criteria to guide training and evaluation.

In summary, the WebGPT paper details a significant advancement in LFQA by demonstrating how a synergistic approach combining web-based retrieval, GPT-3 fine-tuning, and extensive human feedback can produce human-competitive answers. The paper not only marks progress in practical AI capabilities but also opens avenues for more sophisticated and reliable AI-driven information systems.