Papers

Topics

Authors

Recent

View all

Assistant

AI Research Assistant

Well-researched responses based on relevant abstracts and paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses.

Gemini 2.5 Flash

Gemini 2.5 Flash 63 tok/s

Gemini 2.5 Pro 48 tok/s Pro

GPT-5 Medium 27 tok/s Pro

GPT-5 High 27 tok/s Pro

GPT-4o 49 tok/s Pro

Kimi K2 182 tok/s Pro

GPT OSS 120B 433 tok/s Pro

Claude Sonnet 4.5 35 tok/s Pro

2000 character limit reached

WikiChat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on Wikipedia (2305.14292v2)

Published 23 May 2023 in cs.CL

Abstract: This paper presents the first few-shot LLM-based chatbot that almost never hallucinates and has high conversationality and low latency. WikiChat is grounded on the English Wikipedia, the largest curated free-text corpus. WikiChat generates a response from an LLM, retains only the grounded facts, and combines them with additional information it retrieves from the corpus to form factual and engaging responses. We distill WikiChat based on GPT-4 into a 7B-parameter LLaMA model with minimal loss of quality, to significantly improve its latency, cost and privacy, and facilitate research and deployment. Using a novel hybrid human-and-LLM evaluation methodology, we show that our best system achieves 97.3% factual accuracy in simulated conversations. It significantly outperforms all retrieval-based and LLM-based baselines, and by 3.9%, 38.6% and 51.0% on head, tail and recent knowledge compared to GPT-4. Compared to previous state-of-the-art retrieval-based chatbots, WikiChat is also significantly more informative and engaging, just like an LLM. WikiChat achieves 97.9% factual accuracy in conversations with human users about recent topics, 55.0% better than GPT-4, while receiving significantly higher user ratings and more favorable comments.

Citations (57)

View on Semantic Scholar

Summary

The paper demonstrates WikiChat’s ability to achieve 97.3% factual accuracy through a seven-stage pipeline that integrates information retrieval with LLM generation.
It introduces a distillation of its system into a 7 billion parameter model, cutting latency and operational costs for secure, real-world deployments.
A hybrid evaluation combining human and automatic metrics shows WikiChat outperforms baseline models like GPT-4 in accuracy and conversational quality.

An Examination of WikiChat: Mitigating Hallucinations in LLM-Based Chatbots

The paper "WikiChat: Stopping the Hallucination of LLM Chatbots by Few-Shot Grounding on Wikipedia" addresses a critical challenge in the deployment of LLM chatbots: their tendency to produce inaccurate or misleading responses, commonly referred to as hallucinations. The authors present WikiChat, a novel few-shot LLM-based chatbot that is grounded on the English Wikipedia corpus, designed to enhance factual accuracy, conversationality, and response latency.

WikiChat's architecture consists of a seven-stage processing pipeline that incorporates both information retrieval (IR) techniques and the generative capabilities of LLMs. This sophisticated approach focuses on curating relevant and factually accurate information to formulate chatbot responses. The pipeline begins with a query generation stage that utilizes the conversational context to retrieve pertinent information from Wikipedia. This is followed by summarizing and filtering the retrieved data to extract critical facts. A crucial innovation of WikiChat is the two-step verification of chatbot-generated claims against this curated information, which involves rigorous fact-checking using extracted evidence.

One of WikiChat's notable achievements is the distillation of its capabilities into a 7 billion parameter LLaMA model, significantly improving both its latency and operational costs while maintaining high-quality output. This distilled model, based on the WikiChat pipeline, offers a pragmatic solution to the latency and privacy concerns that have traditionally hindered the adoption of third-party LLM APIs in sensitive environments.

A significant contribution of this research is the hybrid evaluation methodology combining human evaluation of factual accuracy and automatic assessment of conversationality metrics. This multi-faceted evaluation approach provides a comprehensive understanding of WikiChat’s performance, highlighting its ability to outperform existing retrieval-based and LLM chatbots in factual accuracy, especially in head, tail, and recent knowledge scenarios. WikiChat showcases a 97.3% factual accuracy rate in simulated conversations, demonstrating a statistical improvement over baseline models such as GPT-4.

In the field of practical implications, the deployment of WikiChat indicates a promising path forward for chatbots aimed at knowledge-intensive domains where accuracy and trust are paramount. The model's ability to be distilled effectively also opens avenues for tailor-made deployments in diverse applications, including those requiring enhanced confidentiality and bespoke knowledge corpora.

Theoretically, the paper underscores the potential for improving LLM reliability through strategic integration of IR methods and LLM generative capabilities. The architecture not only rectifies the issue of hallucinations but also maintains user engagement, emphasizing conversational naturalness without comprising on factual rigor.

Looking forward, the developments showcased in WikiChat could inform future research in several areas: extending the model to other language corpora, integrating real-time updates, and exploring the scalability of similar architectures in specialized fields like healthcare or legal advisory where factual accuracy is non-negotiable. Additionally, further advancements may focus on refining the model to handle multi-turn dialogues more efficiently, enhancing its applicability in complex conversational AI systems.

Overall, WikiChat represents a significant advancement in building reliable, engaging, and efficient chatbots. It demonstrates an impactful combination of LLMs with robust grounding mechanisms, marking a step forward in the quest to mitigate one of the most pressing challenges faced by the AI community today—hallucination in conversational models. This paper provides a compelling example of how grounded approaches could transform LLM-based systems, ultimately driving adoption and trust in AI-driven responses.