CHARP: Conversation History AwaReness Probing for Knowledge-grounded Dialogue Systems (2405.15110v1)
Abstract: In this work, we dive deep into one of the popular knowledge-grounded dialogue benchmarks that focus on faithfulness, FaithDial. We show that a significant portion of the FaithDial data contains annotation artifacts, which may bias models towards completely ignoring the conversation history. We therefore introduce CHARP, a diagnostic test set, designed for an improved evaluation of hallucinations in conversational model. CHARP not only measures hallucination but also the compliance of the models to the conversation task. Our extensive analysis reveals that models primarily exhibit poor performance on CHARP due to their inability to effectively attend to and reason over the conversation history. Furthermore, the evaluation methods of FaithDial fail to capture these shortcomings, neglecting the conversational history. Our findings indicate that there is substantial room for contribution in both dataset creation and hallucination evaluation for knowledge-grounded dialogue, and that CHARP can serve as a tool for monitoring the progress in this particular research area. CHARP is publicly available at https://huggingface.co/datasets/huawei-noah/CHARP
- Nature: Natural auxiliary text utterances for realistic spoken language evaluation. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
- On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258.
- Jean Carletta. 1996. Assessing agreement on classification tasks: The kappa statistic. Computational Linguistics, 22(2):249–254.
- Alpagasus: Training a better alpaca with fewer data. arXiv preprint arXiv:2307.08701.
- How is chatgpt’s behavior changing over time? arXiv preprint arXiv:2307.09009.
- Training deep nets with sublinear memory cost. arXiv preprint arXiv:1604.06174.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Elastic weight removal for faithful and abstractive dialogue generation.
- Towards faithful dialogues via focus learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).
- Wizard of wikipedia: Knowledge-powered conversational agents. arXiv preprint arXiv:1811.01241.
- Faithdial: A faithful benchmark for information-seeking dialogue. Transactions of the Association for Computational Linguistics, 10:1473–1490.
- Evaluating coherence in dialogue systems using entailment. In Proceedings of NAACL-HLT, pages 3806–3812.
- On the origin of hallucinations in conversational models: Is it the datasets or the models? In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5271–5285.
- Evaluating attribution in dialogue systems: The begin benchmark. Transactions of the Association for Computational Linguistics, 10:1066–1083.
- Context-aware adversarial training for name regularity bias in named entity recognition. Transactions of the Association for Computational Linguistics, 9:586–604.
- A knowledge-grounded neural conversation model. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Topical-Chat: Towards Knowledge-Grounded Open-Domain Conversations. In Proc. Interspeech 2019, pages 1891–1895.
- The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717.
- Is gpt-4 a reliable rater? evaluating consistency in gpt-4 text ratings. arXiv preprint arXiv:2308.02575.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- Mixtral of experts. arXiv preprint arXiv:2401.04088.
- Diederik Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474.
- Leveraging large language models for nlg evaluation: A survey. arXiv preprint arXiv:2401.07103.
- G-eval: NLG evaluation using gpt-4 with better human alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 2511–2522, Singapore. Association for Computational Linguistics.
- Gpteval: Nlg evaluation using gpt-4 with better human alignment. arXiv preprint arXiv:2303.16634.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688.
- Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3428–3448.
- Mixed precision training. In In International Conference on Learning Representations.
- I like fish, especially dolphins: Addressing contradictions in dialogue modeling. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1699–1713.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- OpenAI. 2022. ChatGPT: Optimizing language models for dialogue.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- How to evaluate your dialogue system: Probe tasks as an alternative for token-level evaluation metrics. arXiv preprint arXiv:2008.10427.
- Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32:8026–8037.
- Godel: Large-scale pre-training for goal-directed dialog. arXiv preprint arXiv:2206.11309.
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- Exposing shallow heuristics of relation extraction models with challenge data. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 3702–3710.
- Do neural dialog systems use the conversation history effectively? an empirical study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 32–37.
- Towards debiasing fact verification models. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 3410–3416.
- Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188.
- Stanford alpaca: An instruction-following llama model.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Is chatgpt a good nlg evaluator? a preliminary study. arXiv preprint arXiv:2303.04048.
- Dialogue natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3731–3741.
- A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1112–1122.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45.
- Cognitive mirage: A review of hallucinations in large language models. arXiv preprint arXiv:2309.06794.
- A comprehensive assessment of dialog evaluation metrics. In The First Workshop on Evaluations and Assessments of Neural Conversation Systems, pages 15–33.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Paws: Paraphrase adversaries from word scrambling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 1298–1308.
- A dataset for document grounded conversations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 708–713.
- Abbas Ghaddar (18 papers)
- David Alfonso-Hermelo (8 papers)
- Philippe Langlais (23 papers)
- Mehdi Rezagholizadeh (78 papers)
- Boxing Chen (67 papers)
- Prasanna Parthasarathi (23 papers)