Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 34 tok/s Pro
GPT-5 High 36 tok/s Pro
GPT-4o 102 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 433 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

WebGPT: Browser-assisted question-answering with human feedback (2112.09332v3)

Published 17 Dec 2021 in cs.CL, cs.AI, and cs.LG

Abstract: We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

Citations (1,044)

Summary

  • The paper presents WebGPT, a novel method that integrates browser-based retrieval with GPT-3 fine-tuning via imitation and reinforcement learning.
  • It leverages human feedback to train a reward model that selects high-quality answers, outperforming both human demonstrators and Reddit benchmarks.
  • The study demonstrates enhanced long-form question-answering performance, paving the way for more reliable and accurate automated responses.

WebGPT: Browser-Assisted Question-Answering with Human Feedback

The paper presents WebGPT, a novel approach to improving long-form question-answering (LFQA) by leveraging a LLM fine-tuned to interact within a text-based web-browsing environment. This paper outlines how the integration of web-based information retrieval and subsequent refinement through human feedback can significantly enhance the LFQA capabilities of a pre-trained model, specifically GPT-3.

Introduction and Motivation

Current LFQA systems have shown limitations in generating high-quality answers, primarily due to challenges in retrieving relevant information and synthesizing responses. Previous methods have often succeeded in either information retrieval or synthesis but have struggled to integrate these components effectively. WebGPT seeks to address these challenges by utilizing a familiar web-search API (Bing) for document retrieval and fine-tuning GPT-3 for synthesis. The key innovation of WebGPT lies in its environment design and training framework, which integrates imitation learning, reinforcement learning (RL), and human feedback to enhance accuracy and coherence in responses.

Environment Design

At the core of WebGPT is a custom-built text-based web-browsing environment where a model interacts with web pages to retrieve and synthesize information. The environment includes capabilities for performing search queries, navigating through search results, selecting relevant links, quoting text, and formulating final answers. Human demonstrators initially perform these tasks in a graphical interface, creating a dataset of demonstrations that the model uses for behavior cloning (BC).

Training Methodologies

The training process for WebGPT involves several stages:

  1. Behavior Cloning (BC): The model undergoes supervised fine-tuning based on the demonstrations provided by human users interacting with the web-browsing environment. This stage ensures the model can mimic human browsing behavior.
  2. Reward Modeling (RM): A reward model is trained using comparisons between pairs of model-generated answers. Human labelers provide preference judgments, creating a dataset that quantifies human preferences. This reward model predicts these preferences, allowing subsequent optimization.
  3. Reinforcement Learning (RL): Leveraging Proximal Policy Optimization (PPO), the model further fine-tunes its browsing and answering capabilities. The reward model score at the end of episodes, combined with a KL-divergence penalty, guides the optimization.
  4. Rejection Sampling (Best-of-n): This technique involves sampling multiple answers and selecting the highest-scoring one according to the trained reward model. This procedure requires no further training but uses additional inference-time compute.

Evaluation and Results

WebGPT demonstrates substantial improvements in two key evaluations on the ELI5 dataset:

  1. Comparison with Human Demonstrators: The finest WebGPT model's answers are preferred 56% of the time over those written by human demonstrators, suggesting competitive or superior performance to humans when using the designed browsing environment.
  2. Comparison with ELI5 Reddit Answers: When compared with the top-voted answers from Reddit, WebGPT's best model generates preferred answers 69% of the time, significantly surpassing previous benchmarks.

The evaluation also extends to TruthfulQA, where the WebGPT models outperform base GPT-3 models, particularly in balancing truthfulness and informativeness, thus indicating an enhanced capability in handling adversarial questions.

Implications and Future Directions

The enhancements in LFQA demonstrated by WebGPT's approach have significant implications. Practically, such systems can provide more accurate and referenced information, which is crucial for applications requiring reliable automated responses. Theoretically, the integration of human feedback into training paradigms marks a promising direction in improving model interpretability and aligning outputs with human evaluative standards.

Speculation on Future Developments

Future research can build on the findings of WebGPT by exploring:

  • Adversarial Training: Incorporating adversarially selected questions to further enhance the robustness of information retrieval and synthesis.
  • Exploration in RL: Refining exploration strategies in RL to better align with human evaluative metrics and further reduce overoptimization risks.
  • Cross-disciplinary Criteria: Developing more robust and epistemically sound factual accuracy criteria to guide training and evaluation.

In summary, the WebGPT paper details a significant advancement in LFQA by demonstrating how a synergistic approach combining web-based retrieval, GPT-3 fine-tuning, and extensive human feedback can produce human-competitive answers. The paper not only marks progress in practical AI capabilities but also opens avenues for more sophisticated and reliable AI-driven information systems.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 7 tweets and received 35 likes.

Upgrade to Pro to view all of the tweets about this paper:

Youtube Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube