Is There No Such Thing as a Bad Question? H4R: HalluciBot For Ratiocination, Rewriting, Ranking, and Routing (2404.12535v3)
Abstract: Hallucination continues to be one of the most critical challenges in the institutional adoption journey of LLMs. While prior studies have primarily focused on the post-generation analysis and refinement of outputs, this paper centers on the effectiveness of queries in eliciting accurate responses from LLMs. We present HalluciBot, a model that estimates the query's propensity to hallucinate before generation, without invoking any LLMs during inference. HalluciBot can serve as a proxy reward model for query rewriting, offering a general framework to estimate query quality based on accuracy and consensus. In essence, HalluciBot investigates how poorly constructed queries can lead to erroneous outputs - moreover, by employing query rewriting guided by HalluciBot's empirical estimates, we demonstrate that 95.7% output accuracy can be achieved for Multiple Choice questions. The training procedure for HalluciBot consists of perturbing 369,837 queries n times, employing n+1 independent LLM agents, sampling an output from each query, conducting a Multi-Agent Monte Carlo simulation on the sampled outputs, and training an encoder classifier. The idea of perturbation is the outcome of our ablation studies that measures the increase in output diversity (+12.5 agreement spread) by perturbing a query in lexically different but semantically similar ways. Therefore, HalluciBot paves the way to ratiocinate (76.0% test F1 score, 46.6% in saved computation on hallucinatory queries), rewrite (+30.2% positive class transition from hallucinatory to non-hallucinatory), rank (+50.6% positive class transition from hallucinatory to non-hallucinatory), and route queries to effective pipelines.
- Falcon-40B: an open large language model with state-of-the-art performance.
- MathQA: Towards interpretable math word problem solving with operation-based formalisms. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2357–2367, Minneapolis, Minnesota. Association for Computational Linguistics.
- Piqa: Reasoning about physical commonsense in natural language. In Thirty-Fourth AAAI Conference on Artificial Intelligence.
- Extracting training data from large language models.
- Palm: Scaling language modeling with pathways.
- BoolQ: Exploring the surprising difficulty of natural yes/no questions. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 2924–2936, Minneapolis, Minnesota. Association for Computational Linguistics.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1.
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1):37–46.
- Selectively answering ambiguous questions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 530–543, Singapore. Association for Computational Linguistics.
- Lee J. Cronbach. 1951. Coefficient alpha and the internal structure of tests. Psychometrika, 16(3):297–334.
- BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, abs/1810.04805.
- Language model cascades.
- Hierarchical neural story generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 889–898, Melbourne, Australia. Association for Computational Linguistics.
- Manaal Faruqui and Dipanjan Das. 2018. Identifying well-formed natural language questions. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 798–803, Brussels, Belgium. Association for Computational Linguistics.
- J. L. Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5):378–382.
- A framework for few-shot language model evaluation.
- Jack P. Gibbs and Jr. Poston, Dudley L. 1975. The Division of Labor: Conceptualization and Related Measures*. Social Forces, 53(3):468–476.
- Google. 2023. Introducing gemini: our largest and most capable ai model.
- Retrieval augmented language model pre-training. In Proceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, pages 3929–3938. PMLR.
- Aligning ai with shared human values. Proceedings of the International Conference on Learning Representations (ICLR).
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR).
- The curious case of neural text degeneration.
- Learning to write with cooperative discriminators. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1638–1649, Melbourne, Australia. Association for Computational Linguistics.
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.
- Gautier Izacard and Edouard Grave. 2021. Leveraging passage retrieval with generative models for open domain question answering. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 874–880, Online. Association for Computational Linguistics.
- Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
- How Can We Know When Language Models Know? On the Calibration of Language Models for Question Answering. Transactions of the Association for Computational Linguistics, 9:962–977.
- Matt Gardner Johannes Welbl, Nelson F. Liu. 2017. Crowdsourcing multiple choice science questions.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
- triviaqa: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. arXiv e-prints, page arXiv:1705.03551.
- Language models (mostly) know what they know.
- Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, Online. Association for Computational Linguistics.
- UNIFIEDQA: Crossing format boundaries with a single QA system. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1896–1907, Online. Association for Computational Linguistics.
- Vilt: Vision-and-language transformer without convolution or region supervision.
- Diederik P. Kingma and Jimmy Ba. 2017. Adam: A method for stochastic optimization.
- Large language models are zero-shot reasoners.
- G. Kuder and M. Richardson. 1937. The theory of the estimation of test reliability. Psychometrika, 2(3):151–160.
- Ashish Kumar. 2021. Query wellformedness scoring.
- Gradient-based constrained sampling from language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 2251–2277, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- Retrieval-augmented generation for knowledge-intensive nlp tasks. In Advances in Neural Information Processing Systems, volume 33, pages 9459–9474. Curran Associates, Inc.
- Retrieval-augmented generation for knowledge-intensive nlp tasks.
- Making large language models better reasoners with step-aware verifier.
- Holistic evaluation of language models.
- TruthfulQA: Measuring how models mimic human falsehoods. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3214–3252, Dublin, Ireland. Association for Computational Linguistics.
- What makes good in-context examples for gpt-3333?
- Roberta: A robustly optimized BERT pretraining approach. CoRR, abs/1907.11692.
- Frederic M. Lord. 1952. The relation of the reliability of multiple-choice tests to the distribution of item difficulties. Psychometrika, 17(2):181–194.
- Self-refine: Iterative refinement with self-feedback.
- Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models.
- Microsoft. 2023. Your everyday ai companion: Microsoft bing.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. In Conference on Empirical Methods in Natural Language Processing.
- Hallucinations leaderboard. https://huggingface.co/spaces/hallucinations-leaderboard/leaderboard.
- Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. The MIT Press.
- Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage re-ranking with bert.
- Show your work: Scratchpads for intermediate computation with language models.
- OpenAI. 2022. Introducing chatgpt.
- Check your facts and try again: Improving large language models with external knowledge and automated feedback.
- KILT: a benchmark for knowledge intensive language tasks. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2523–2544, Online. Association for Computational Linguistics.
- Improving language understanding by generative pre-training. https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.
- Know what you don’t know: Unanswerable questions for squad.
- Squad: 100,000+ questions for machine comprehension of text.
- Nils Reimers and Iryna Gurevych. 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
- Stuart Russell and Peter Norvig. 2009. Artificial Intelligence: A Modern Approach, 3rd edition. Prentice Hall Press, USA.
- C. E. Shannon. 1948. A mathematical theory of communication. The Bell System Technical Journal, 27(3):379–423.
- On early detection of hallucinations in factual question answering.
- Parasuraman Swaminathan. 2021. Monte carlo simulations as a route to compute probabilities.
- A comprehensive survey of hallucination mitigation techniques in large language models.
- A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation.
- SuperGLUE: A stickier benchmark for general-purpose language understanding systems. arXiv preprint 1905.00537.
- Self-consistency improves chain of thought reasoning in language models.
- Finetuned language models are zero-shot learners.
- Chain-of-thought prompting elicits reasoning in large language models.
- Allen R. Wilcox. 1973. Indices of qualitative variation and political measurement. The Western Political Quarterly, 26(2):325–343.
- Christopher Winship and Robert D. Mare. 1984. Regression models with ordinal variables. American Sociological Review, 49(4):512–525.
- Promptchainer: Chaining large language model prompts through visual programming. In Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems, CHI EA ’22, New York, NY, USA. Association for Computing Machinery.
- Corrective retrieval augmented generation.
- WikiQA: A challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 2013–2018, Lisbon, Portugal. Association for Computational Linguistics.
- HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium. Association for Computational Linguistics.
- Generate rather than retrieve: Large language models are strong context generators. In International Conference for Learning Representation (ICLR).
- Improving deep neural networks using softplus units. In 2015 International Joint Conference on Neural Networks (IJCNN), pages 1–4.
- Hongyi Zheng and Abulhair Saparov. 2023. Noisy exemplars make large language models more robust: A domain-agnostic behavioral analysis.
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.