- The paper provides a comprehensive review of hallucinations in NLG by categorizing them as intrinsic and extrinsic and identifying key data and training challenges.
- It evaluates both statistical and model-based metrics to assess the faithfulness of generated text across diverse tasks.
- Mitigation strategies, including data cleaning and architectural modifications, are discussed to improve output reliability and guide future research.
Survey of Hallucination in Natural Language Generation
The paper "Survey of Hallucination in Natural Language Generation" by Ziwei Ji et al. offers an insightful overview of the hallucinatory phenomena in Natural Language Generation (NLG), specifically focusing on recent advances in Transformer-based models. Hallucinations in NLG refer to instances where the generated text diverges from the source material, producing text that is either factually incorrect or irrelevant. Despite advancements in generating coherent and fluent text, these hallucinations pose significant challenges across various NLP applications, such as summarization, dialogue systems, machine translation, generative question answering (GQA), data-to-text generation, and vision-language tasks. This survey systematically reviews current progress, metrics, and mitigation methods, and also addresses future research directions.
Definitions and Categorization
Hallucinations in NLG are broadly categorized into intrinsic and extrinsic types. Intrinsic hallucinations occur when the generated output contradicts the source input, while extrinsic hallucinations involve the generation of unverifiable or irrelevant information that cannot be validated against the source. This categorization is integral to understanding and addressing hallucinations across different NLG tasks.
Contributors to Hallucination
The primary contributors to hallucination are identified as data-related issues and deficiencies in training and inference processes:
- Data-Related Issues: Data noise introduced during heuristic data collection or inherent divergence in certain NLG tasks can lead to hallucinations. Mismatches between the source and target data can train models to produce unfaithful outputs.
- Training and Inference Deficiencies: These include imperfect representation learning, erroneous decoding strategies, exposure bias, and parametric knowledge biases that cause models to favor memorized information over the provided input.
Metrics for Measuring Hallucination
The survey categorizes hallucination metrics into statistical and model-based approaches:
- Statistical Metrics: Metrics like PARENT and its variants measure lexical overlap between generated text and source/reference texts but fall short in capturing syntactic and semantic variations.
- Model-Based Metrics: These utilize Information Extraction (IE), QA-based, NLI-based, and LM-based methods to evaluate faithfulness. Each approach has its strengths and potential limitations, such as error propagation from IE or QA models and domain adaptation issues of NLI models.
Mitigation Methods
The survey delineates two main categories of mitigation methods: data-related and modeling/inference techniques:
- Data-Related Methods: These involve building faithful datasets, cleaning existing datasets to remove noise, and augmenting data with explicit knowledge to improve the alignment between input and output.
- Modeling and Inference Techniques: Approaches such as architectural modifications, planning/sketching, reinforcement learning, multi-task learning, and constrained generation strategies are explored to enhance faithfulness and reduce hallucination. Post-processing methods like generating-then-refining are also employed to correct unfaithful content.
Task-Specific Insights
Abstractive Summarization
Hallucination in summarization tasks surfaces as summaries that contain additional or contradictory information compared to the source. Fine-grained metrics distinguish between semantic frame errors, discourse errors, and content verifiability errors. Model-based and IE-based metrics are particularly useful in evaluating such hallucinations.
Dialogue Generation
Both open-domain and task-oriented dialogue systems exhibit self-consistency and external consistency issues. Methods like persona conditioning, retrieval-augmented generation, and response refinement help mitigate hallucinations in dialogue models.
Generative Question Answering
GQA systems face hallucination when generating long-form answers from multiple sources. Current mitigation focuses on improved retrieval models and frameworks that synthesize information accurately from multiple documents.
Data-to-Text Generation
Structured data to natural language generation tasks suffer due to the semantic gap. Planning/skeleton-based methods, entity-centric metrics, and hybrid models using variational Bayesian techniques or uncertainty-aware beam searches are employed to mitigate hallucinations.
Implications and Future Directions
Addressing hallucinations is critical for enhancing the trustworthiness and applicability of NLG systems in real-world scenarios. Future research directions include:
- Developing fine-grained and generalizable metrics that can adapt to different tasks.
- Incorporating human cognitive perspectives to improve automatic evaluation methods.
- Exploring reasoning capabilities of models to reduce intrinsic hallucinations.
- Investigating task-specific nuances such as handling numerical data in reports.
- Enhancing controllability to balance faithfulness against diversity and informativeness.
Conclusion
This comprehensive survey by Ji et al. provides a robust framework for understanding and addressing hallucinations in NLG. It underscores the necessity for refined metrics, innovative mitigation strategies, and further exploratory research in the field to develop more accurate and reliable NLG systems.