Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses (2410.22349v1)

Published 15 Oct 2024 in cs.IR, cs.AI, cs.CL, cs.CY, and cs.HC
Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses

Abstract: LLM-based applications are graduating from research prototypes to products serving millions of users, influencing how people write and consume information. A prominent example is the appearance of Answer Engines: LLM-based generative search engines supplanting traditional search engines. Answer engines not only retrieve relevant sources to a user query but synthesize answer summaries that cite the sources. To understand these systems' limitations, we first conducted a study with 21 participants, evaluating interactions with answer vs. traditional search engines and identifying 16 answer engine limitations. From these insights, we propose 16 answer engine design recommendations, linked to 8 metrics. An automated evaluation implementing our metrics on three popular engines (You.com, Perplexity.ai, BingChat) quantifies common limitations (e.g., frequent hallucination, inaccurate citation) and unique features (e.g., variation in answer confidence), with results mirroring user study insights. We release our Answer Engine Evaluation benchmark (AEE) to facilitate transparent evaluation of LLM-based applications.

Analyzing the Sociotechnical Dynamics of Answer Engines in AI-Based Search

The paper "Search Engines in an AI Era: The False Promise of Factual and Verifiable Source-Cited Responses" presents a comprehensive paper on the limitations and societal implications of Answer Engines. As LLMs become increasingly integrated into daily information retrieval tasks, they are metamorphosing from research instruments into influential technologies. This transformation demands an acute understanding of their utility and impact beyond the surface level, especially within the sociotechnical framework that this paper examines.

Key Findings from the Usability Study

The authors conducted an audit-centric usability paper involving 21 participants, focusing on the comparison between answer engines and traditional search engines. Through this paper, 16 core limitations of answer engines were identified. These limitations can be grouped based on four main components of an answer engine: the generated answer text, citations, sources, and user interface. Notably, three crucial limitations include:

  1. The Lack of Objective Detail and Balance: Participants noted that answers were often devoid of necessary depth and presented one-sided perspectives. This propensity limits the exploration of diverse views, particularly in answering opinionated or debate-based queries.
  2. Confidence and Improper Source Attribution: The paper revealed that answer engines often exhibited unjustified confidence in their responses and frequently misattributed citations. This gap raises caution regarding trust and factuality in the information these engines present.
  3. User Autonomy and Source Transparency: Participants expressed a lack of control over source selection and verification, resulting from a predominantly opaque system architecture. This inadequacy impacts user trust and autonomy in verifying information accuracy.

Quantitative Evaluation Metrics and Results

Building on insights from the paper, the authors propose eight evaluation metrics for a systematic assessment of answer engines. These metrics examine aspects such as citation accuracy, statement relevance, and source necessity. The application of this framework across popular answer engines–You.com, Perplexity.ai, and BingChat–revealed substantial room for improvement. The engines frequently generate one-sided and overconfident answers, with Perplexity notably underperforming due to heightened confidence levels regardless of the question's nature.

Broader Implications

From a practical perspective, the findings underscore the necessity for continuous evaluation and transparency as these systems further permeate sociotechnical systems like healthcare and education. Theoretical implications include pondering the evolution of autonomous search engines into more comprehensive decision-making tools. As these technologies refine, their influence on user critical thinking and information verification practices demands scrutiny.

Future Developments

Looking forward, this field may witness advancements through improved interaction models that involve human feedback and better contextual understanding. Establishing robust governance structures and policies around AI applications remains crucial to mitigate bias and maintain ethical standards in information dissemination.

Conclusion

In conclusion, this paper emphasizes the importance of developing answer engines that are not only powerful in generating useful information but also aligned with ethical and transparent practices that support user empowerment. By conducting a meticulous audit, the authors contribute substantially to the discourse on AI-driven information retrieval systems, setting a precedent for future AI technologies and their integration into societal frameworks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Sabira Arefin. 2024. AI Revolutionizing Healthcare: Innovations, Challenges, and Ethical Considerations. MZ Journal of Artificial Intelligence 1, 2 (2024), 1–17.
  2. Carl Auerbach and Louise B Silverstein. 2003. Qualitative data: An introduction to coding and analysis. Vol. 21. NYU press.
  3. Emily M Bender. 2024. Resisting Dehumanization in the Age of “AI”. Current Directions in Psychological Science 33, 2 (2024), 114–120.
  4. On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
  5. Language (Technology) is Power: A Critical Survey of “Bias” in NLP. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5454–5476.
  6. The impact of generative artificial intelligence on socioeconomic inequalities and policy making. PNAS nexus 3, 6 (2024).
  7. Kathy Charmaz. 2006. Constructing grounded theory: A practical guide through qualitative analysis. sage.
  8. Kathy Charmaz. 2017. Constructivist grounded theory. The Journal of Positive Psychology 12, 3 (2017), 299–300.
  9. Evaluating Top-k RAG-based approach for Game Review Generation. In 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT), Vol. 5. IEEE, 258–263.
  10. Robert Cooper and Michael Foster. 1971. Sociotechnical systems. American Psychologist 26, 5 (1971), 467.
  11. Dipto Das. 2023. Studying Multi-dimensional Marginalization of Identity from Decolonial and Postcolonial Perspectives. In Companion Publication of the 2023 Conference on Computer Supported Cooperative Work and Social Computing. 437–440.
  12. The“Colonial Impulse” of Natural Language Processing: An Audit of Bengali Sentiment Analysis Tools and Their Identity-based Biases. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–18.
  13. Dipto Das and Bryan Semaan. 2022. Decolonial and Postcolonial Computing Research: A Scientometric Exploration. In Companion Publication of the 2022 Conference on Computer Supported Cooperative Work and Social Computing. 168–174.
  14. Axes for sociotechnical inquiry in AI research. IEEE Transactions on Technology and Society 2, 2 (2021), 62–70.
  15. A survey on rag meets llms: Towards retrieval-augmented large language models. arXiv preprint arXiv:2405.06211 (2024).
  16. The Who in XAI: How AI Background Shapes Perceptions of AI Explanations. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–32.
  17. Upol Ehsan and Mark O Riedl. 2020. Human-centered explainable ai: Towards a reflective sociotechnical approach. In HCI International 2020-Late Breaking Papers: Multimodality and Intelligence: 22nd HCI International Conference, HCII 2020, Copenhagen, Denmark, July 19–24, 2020, Proceedings 22. Springer, 449–466.
  18. Human-Centered Explainable AI (HCXAI): Reloading Explainability in the Era of Large Language Models (LLMs). In Extended Abstracts of the CHI Conference on Human Factors in Computing Systems. 1–6.
  19. RAGAs: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations. 150–158.
  20. Ragas: Automated evaluation of retrieval augmented generation. arXiv preprint arXiv:2309.15217 (2023).
  21. A Survey on RAG Meeting LLMs: Towards Retrieval-Augmented Large Language Models. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 6491–6501.
  22. Emilio Ferrara. 2024. GenAI against humanity: Nefarious applications of generative artificial intelligence and large language models. Journal of Computational Social Science (2024), 1–21.
  23. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997 (2023).
  24. From melting pots to misrepresentations: Exploring harms in generative ai. arXiv preprint arXiv:2403.10776 (2024).
  25. Do Generative AI Models Output Harm while Representing Non-Western Cultures: Evidence from A Community-Centered Approach. arXiv preprint arXiv:2407.14779 (2024).
  26. Barney Glaser. 1992. Basics of grounded theory analysis: Emergence vs forcing. (1992).
  27. Barney Glaser and Anselm Strauss. 1967. Discovery of grounded theory: Strategies for qualitative research. Routledge.
  28. Eric Goldman. 2005. Search engine bias and the demise of search engine utopianism. Yale JL & Tech. 8 (2005), 188.
  29. David Grant. 2025. Populism, Artificial Intelligence and Law: A New Understanding of the Dynamics of the Present. Taylor & Francis.
  30. RAG vs Fine-tuning: Pipelines, Tradeoffs, and a Case Study on Agriculture. arXiv preprint arXiv:2401.08406 (2024).
  31. Sociodemographic bias in language models: A survey and forward path. In Proceedings of the 5th Workshop on Gender Bias in Natural Language Processing (GeBNLP). 295–322.
  32. Jutta Haider and Olof Sundin. 2019. Invisible search and online search engines: The ubiquity of search in everyday life. Taylor & Francis.
  33. Donna Haraway. 2013. Situated knowledges: The science question in feminism and the privilege of partial perspective 1. In Women, science, and technology. Routledge, 455–472.
  34. Carol Mullins Hayes. 2023. Generative artificial intelligence and copyright: Both sides of the Black Box. Available at SSRN 4517799 (2023).
  35. Wayne Holmes and Ilkka Tuomi. 2022. State of the art and practice in AI in education. European Journal of Education 57, 4 (2022), 542–570.
  36. John E Hopcroft and Richard M Karp. 1973. An n^5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing 2, 4 (1973), 225–231.
  37. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023).
  38. Adaptive-RAG: Learning to Adapt Retrieval-Augmented Large Language Models through Question Complexity. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 7029–7043.
  39. Evaluating Large Language Models for Health-related Queries with Presuppositions. In Findings of the Association for Computational Linguistics ACL 2024, Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 14308–14331. https://aclanthology.org/2024.findings-acl.850
  40. FABLES: Evaluating faithfulness and content selection in book-length summarization. arXiv preprint arXiv:2404.01261 (2024).
  41. Summary of a Haystack: A Challenge to Long-Context LLMs and RAG Systems. arXiv preprint arXiv:2407.01370 (2024).
  42. Llms as factual reasoners: Insights from existing benchmarks and beyond. arXiv preprint arXiv:2305.14540 (2023).
  43. Are you sure? challenging llms leads to performance drops in the flipflop experiment. arXiv preprint arXiv:2311.08596 (2023).
  44. SummaC: Re-visiting NLI-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics 10 (2022), 163–177.
  45. Pre-training via Paraphrasing. Advances in Neural Information Processing Systems 33 (2020), 18470–18481.
  46. Alice Li and Luanne Sinnamon. 2024. Generative AI Search Engines as Arbiters of Public Knowledge: An Audit of Bias and Authority. arXiv preprint arXiv:2405.14034 (2024).
  47. Nora Freya Lindemann. 2024. Chatbots, search engines, and the sealing of knowledges. AI & SOCIETY (2024), 1–14.
  48. Artificial intelligence as a service: classification and research directions. Business & Information Systems Engineering 63 (2021), 441–456.
  49. Evaluating Verifiability in Generative Search Engines. In Findings of the Association for Computational Linguistics: EMNLP 2023. 7001–7025.
  50. Are search engines biased? Detecting and reducing bias using meta search engines. Electronic Commerce Research and Applications (2022), 101132.
  51. Exploring think-alouds in usability testing: An international survey. IEEE Transactions on Professional Communication 55, 1 (2012), 2–19.
  52. Shahan Ali Memon and Jevin D West. 2024. Search engines post-ChatGPT: How generative artificial intelligence could make search less reliable. arXiv preprint arXiv:2402.11707 (2024).
  53. Abbe Mowshowitz and Akira Kawaguchi. 2005. Measuring search engine bias. Information processing & management 41, 5 (2005), 1193–1205.
  54. Rainer Mühlhoff. 2018. Digitale Entmündigung und User Experience Design. Leviathan 46, 4 (2018), 551–574.
  55. Having Beer after Prayer? Measuring Cultural Bias in Large Language Models. In Annual Meeting of the Association for Computational Linguistics. https://api.semanticscholar.org/CorpusID:258865272
  56. Pranav Narayanan Venkit. 2023. Towards a holistic approach: Understanding sociodemographic biases in nlp models using an interdisciplinary lens. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 1004–1005.
  57. Unmasking nationality bias: A study of human perception of nationalities in ai-generated articles. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society. 554–565.
  58. Casey Newton. 2024. How to stop perplexity and save the web from bad ai. https://www.platformer.news/how-to-stop-perplexity-oreilly-ai-publishing/
  59. Mie Nørgaard and Kasper Hornbæk. 2006. What do usability evaluators do in practice? An explorative study of think-aloud testing. In Proceedings of the 6th conference on Designing Interactive systems. 209–218.
  60. Cathy O’neil. 2017. Weapons of math destruction: How big data increases inequality and threatens democracy. Crown.
  61. GenAI and the Public Sector. In Empowering the Public Sector with Generative AI: From Strategy and Design to Real-World Applications. Springer, 31–43.
  62. A survey of hallucination in large foundation models. arXiv preprint arXiv:2309.05922 (2023).
  63. Kylie Robison. 2024. Google promised a better search experience - now it’s telling us to put glue on our pizza. https://www.theverge.com/2024/5/23/24162896/google-ai-overview-hallucinations-glue-in-pizza
  64. Evaluation of RAG Metrics for Question Answering in the Telecom Domain. arXiv preprint arXiv:2407.12873 (2024).
  65. Eryk Salvaggio. 2024. Challenging the myths of Generative AI. https://www.techpolicy.press/challenging-the-myths-of-generative-ai/
  66. Deepa Seetharaman. 2024. https://www.wsj.com/tech/ai/openai-search-engine-searchgpt-97771f86
  67. Chirag Shah and Emily M Bender. 2024. Envisioning information access systems: What makes for good tools and a healthy Web? ACM Transactions on the Web 18, 3 (2024), 1–24.
  68. Generative Echo Chamber? Effect of LLM-Powered Search Systems on Diverse Information Seeking. In Proceedings of the CHI Conference on Human Factors in Computing Systems. 1–17.
  69. Improving the domain adaptation of retrieval augmented generation (RAG) models for open domain question answering. Transactions of the Association for Computational Linguistics 11 (2023), 1–17.
  70. Elizabeth A St. Pierre and Alecia Y Jackson. 2014. Qualitative data analysis after coding. , 715–719 pages.
  71. MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents. arXiv preprint arXiv:2404.10774 (2024).
  72. The Sentiment Problem: A Critical Survey towards Deconstructing Sentiment Analysis. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 13743–13763.
  73. ” Confidently Nonsensical?”: A Critical Survey on the Perspectives and Challenges of’Hallucinations’ in NLP. arXiv preprint arXiv:2404.07461 (2024).
  74. Nationality Bias in Text Generation. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 116–122.
  75. Automated Ableism: An Exploration of Explicit Disability Biases in Sentiment and Toxicity Analysis Models. In Proceedings of the 3rd Workshop on Trustworthy Natural Language Processing (TrustNLP 2023). 26–34.
  76. PJ Vogt. 2024. How much glue should be in your pizza? https://pjvogt.substack.com/p/how-much-glue-should-be-in-your-pizza
  77. How faithful are RAG models? Quantifying the tug-of-war between RAG and LLMs’ internal prior. arXiv preprint arXiv:2404.10198 (2024).
  78. Elvin Wyly. 2014. Automated (post) positivism. Urban Geography 35, 5 (2014), 669–690.
  79. RAGEval: Scenario Specific RAG Evaluation Dataset Generation Framework. arXiv preprint arXiv:2408.01262 (2024).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Pranav Narayanan Venkit (19 papers)
  2. Philippe Laban (40 papers)
  3. Yilun Zhou (28 papers)
  4. Yixin Mao (4 papers)
  5. Chien-Sheng Wu (77 papers)