Bias and Unfairness in Information Retrieval Systems: New Challenges in the LLM Era (2404.11457v2)
Abstract: With the rapid advancements of LLMs, information retrieval (IR) systems, such as search engines and recommender systems, have undergone a significant paradigm shift. This evolution, while heralding new opportunities, introduces emerging challenges, particularly in terms of biases and unfairness, which may threaten the information ecosystem. In this paper, we present a comprehensive survey of existing works on emerging and pressing bias and unfairness issues in IR systems when the integration of LLMs. We first unify bias and unfairness issues as distribution mismatch problems, providing a groundwork for categorizing various mitigation strategies through distribution alignment. Subsequently, we systematically delve into the specific bias and unfairness issues arising from three critical stages of LLMs integration into IR systems: data collection, model development, and result evaluation. In doing so, we meticulously review and analyze recent literature, focusing on the definitions, characteristics, and corresponding mitigation strategies associated with these issues. Finally, we identify and highlight some open problems and challenges for future work, aiming to inspire researchers and stakeholders in the IR field and beyond to better understand and mitigate bias and unfairness issues of IR in this LLM era. We also consistently maintain a GitHub repository for the relevant papers and resources in this rising direction at https://github.com/KID-22/LLM-IR-Bias-Fairness-Survey.
- Multistakeholder recommendation: Survey and research directions. User Modeling and User-Adapted Interaction 30, 1 (2020), 127–158.
- The unfairness of popularity bias in recommendation. arXiv preprint arXiv:1907.13286 (2019).
- Persistent anti-muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society. 298–306.
- Information Retrieval Meets Large Language Models: A Strategic Report from Chinese IR Community. AI Open 4 (2023), 80–90.
- Towards tracing factual knowledge in language models back to the training data. arXiv preprint arXiv:2205.11482 (2022).
- Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).
- Longalign: A recipe for long context alignment of large language models. arXiv preprint arXiv:2401.18058 (2024).
- FairMonitor: A Four-Stage Automatic Framework for Detecting Stereotypes and Biases in Large Language Models. arXiv:2308.10397 [cs.CL]
- A bi-step grounding paradigm for large language models in recommendation systems. arXiv preprint arXiv:2308.08434 (2023).
- On the dangers of stochastic parrots: Can language models be too big?. In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 610–623.
- Camiel J Beukeboom and Christian Burgers. 2019. How stereotypes are shared through language: a review and introduction of the aocial categories and stereotypes communication (SCSC) framework. Review of Communication Research 7 (2019), 1–37.
- Demographic dialectal variation in social media: A case study of African-American English. arXiv preprint arXiv:1608.08868 (2016).
- Shikha Bordia and Samuel R Bowman. 2019. Identifying and reducing gender bias in word-level language models. arXiv preprint arXiv:1904.03035 (2019).
- A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226 (2023).
- A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023).
- Humans or LLMs as the Judge? A Study on Judgement Biases. arXiv preprint arXiv:2402.10669 (2024).
- Bias and debias in recommender system: A survey and future directions. ACM Transactions on Information Systems 41, 3 (2023), 1–39.
- Complex Claim Verification with Evidence Retrieved in the Wild. arXiv preprint arXiv:2305.11859 (2023).
- Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307 (2023).
- FacTool: Factuality Detection in Generative AI–A Tool Augmented Framework for Multi-Task and Multi-Domain Scenarios. arXiv preprint arXiv:2307.13528 (2023).
- Can Large Language Models be Trusted for Evaluation? Scalable Meta-Evaluation of LLMs as Evaluators via Agent Debate. arXiv preprint arXiv:2401.16788 (2024).
- Cheng-Han Chiang and Hung-yi Lee. 2023. Can Large Language Models Be an Alternative to Human Evaluations? Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (2023).
- PRE: A Peer Review Based Large Language Model Evaluator. arXiv preprint arXiv:2401.15641 (2024).
- Dola: Decoding by contrasting layers improves factuality in large language models. arXiv preprint arXiv:2309.03883 (2023).
- Increasing diversity while maintaining accuracy: Text data generation with large language models and human interventions. arXiv preprint arXiv:2306.04140 (2023).
- Uncovering ChatGPT’s Capabilities in Recommender Systems. In Proceedings of the 17th ACM Conference on Recommender Systems.
- Llms may dominate information access: Neural retrievers are biased towards llm-generated texts. arXiv preprint arXiv:2310.20501 (2023).
- Under the Surface: Tracking the Artifactuality of LLM-Generated Data. arXiv preprint arXiv:2401.14698 (2024).
- Echo chambers: Emotional contagion and group polarization on facebook. Scientific reports 6, 1 (2016), 1–12.
- Yashar Deldjoo. 2024. Understanding Biases in ChatGPT-based Recommender Systems: Provider Fairness, Temporal Stability, and Recency. arXiv preprint arXiv:2401.10545 (2024).
- Yashar Deldjoo and Tommaso di Noia. 2024. CFaiRLLM: Consumer Fairness Evaluation in Large-Language Model Recommender System. arXiv:2403.05668 [cs.IR]
- Measuring and mitigating unintended bias in text classification. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. 67–73.
- FEQA: A question answering evaluation framework for faithfulness assessment in abstractive summarization. arXiv preprint arXiv:2005.03754 (2020).
- Evaluating groundedness in dialogue systems: The begin benchmark. arXiv preprint arXiv:2105.00071 4 (2021).
- Naomi Ellemers. 2018. Gender stereotypes. Annual review of psychology 69 (2018), 275–298.
- Recommender systems in the era of large language models (llms). arXiv preprint arXiv:2307.02046 (2023).
- Bias of AI-generated content: an examination of news produced by large language models. Scientific Reports 14, 1 (2024), 1–20.
- Bridging the gap: A survey on integrating (human) feedback for natural language generation. Transactions of the Association for Computational Linguistics 11 (2023), 1643–1668.
- Fair diffusion: Instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893 (2023).
- Gptscore: Evaluate as you desire. arXiv preprint arXiv:2302.04166 (2023).
- Data Engineering for Scaling Language Models to 128K Context. arXiv preprint arXiv:2402.10171 (2024).
- Bias and fairness in large language models: A survey. arXiv preprint arXiv:2309.00770 (2023).
- Rarr: Researching and revising what language models say, using language models. arXiv preprint arXiv:2210.08726 (2022).
- Llm-based nlg evaluation: Current status and challenges. arXiv preprint arXiv:2402.01383 (2024).
- He is very intelligent, she is very beautiful? on mitigating social biases in language modelling and generation. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. 4534–4545.
- Gender-tuning: Empowering fine-tuning for debiasing pre-trained language models. arXiv preprint arXiv:2307.10522 (2023).
- Let the algorithm speak: How to use neural networks for automatic item generation in psychological scale development. Psychological Methods (2023).
- Studying large language model generalization with influence functions. arXiv preprint arXiv:2308.03296 (2023).
- Pseudo-Discrimination Parameters from Language Embeddings. (2024).
- Textbooks Are All You Need. arXiv preprint arXiv:2306.11644 (2023).
- A real-world webagent with planning, long context understanding, and program synthesis. arXiv preprint arXiv:2307.12856 (2023).
- The times they are a-changing… or are they not? A comparison of gender stereotypes, 1983–2014. Psychology of Women Quarterly 40, 3 (2016), 353–363.
- Balancing out bias: Achieving fairness through balanced training. arXiv preprint arXiv:2109.08253 (2021).
- Hans WA Hanley and Zakir Durumeric. 2023. Machine-Made Media: Monitoring the Mobilization of Machine-Generated Articles on Misinformation and Mainstream News Websites. arXiv preprint arXiv:2305.09820 (2023).
- Allure: A systematic protocol for auditing and improving llm-based evaluation of text using iterative in-context-learning. arXiv preprint arXiv:2309.13701 (2023).
- Large language models as zero-shot conversational recommenders. In Proceedings of the 32nd ACM international conference on information and knowledge management. 720–730.
- Large Language Models are Zero-Shot Rankers for Recommender Systems. In ECIR.
- Up5: Unbiased foundation model for fairness-aware recommendation. arXiv preprint arXiv:2305.12090 (2023).
- An Empirical Study of LLM-as-a-Judge for LLM Evaluation: Fine-tuned Judge Models are Task-specific Classifiers. arXiv preprint arXiv:2403.02839 (2024).
- A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232 (2023).
- Reducing sentiment bias in language models via counterfactual evaluation. arXiv preprint arXiv:1911.03064 (2019).
- Jon Hurwitz and Mark Peffley. 1997. Public perceptions of race and crime: The role of racial stereotypes. American journal of political science (1997), 375–401.
- Evaluating and inducing personality in pre-trained language models. Advances in Neural Information Processing Systems 36 (2024).
- Item-side Fairness of Large Language Model-based Recommendation System. arXiv preprint arXiv:2402.15215 (2024).
- Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning. PMLR, 15696–15707.
- Cheongwoong Kang and Jaesik Choi. 2023. Impact of co-occurrence on factual knowledge of large language models. arXiv preprint arXiv:2310.08256 (2023).
- Estimating the personality of white-box language models. CoRR, abs/2204.12000 (2023).
- Gpt-4 passes the bar exam. Philosophical Transactions of the Royal Society A 382, 2270 (2024), 20230254.
- Critic-guided decoding for controlled text generation. arXiv preprint arXiv:2212.10938 (2022).
- Evallm: Interactive evaluation of large language model prompts on user-defined criteria. arXiv preprint arXiv:2309.13633 (2023).
- Benchmarking cognitive biases in large language models as evaluators. arXiv preprint arXiv:2309.17012 (2023).
- When do pre-training biases propagate to downstream tasks? a case study in text summarization. In Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics. 3206–3219.
- Antonio Laverghetta Jr and John Licato. 2023. Generating better items for cognitive assessments using large language models. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023). 414–428.
- Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499 (2021).
- Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems 35 (2022), 34586–34599.
- Halueval: A large-scale hallucination evaluation benchmark for large language models. In The 2023 Conference on Empirical Methods in Natural Language Processing.
- LooGLE: Can Long-Context Language Models Understand Long Contexts? arXiv preprint arXiv:2311.04939 (2023).
- Large language models for generative recommendation: A survey and visionary discussions. arXiv preprint arXiv:2309.01157 (2023).
- Prd: Peer rank and discussion improve large language model based evaluations. arXiv preprint arXiv:2307.02762 (2023).
- How pre-trained language models capture factual knowledge? a causal-inspired analysis. arXiv preprint arXiv:2203.16747 (2022).
- Tailoring personality traits in large language models via unsupervisedly-built personalized lexicons. arXiv preprint arXiv:2310.16582 (2023).
- A preliminary study of chatgpt on news recommendation: Personalization, provider fairness, fake news. arXiv preprint arXiv:2306.10702 (2023).
- Fairness in recommendation: Foundations, methods, and applications. ACM Transactions on Intelligent Systems and Technology 14, 5 (2023), 1–48.
- A survey on fairness in large language models. arXiv preprint arXiv:2308.10149 (2023).
- Split and merge: Aligning position biases in large language model based evaluators. arXiv preprint arXiv:2310.01432 (2023).
- Leveraging large language models for nlg evaluation: A survey. arXiv preprint arXiv:2401.07103 (2024).
- How Can Recommender Systems Benefit from Large Language Models: A Survey. arXiv preprint arXiv:2306.05817 (2023).
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958 (2021).
- Does gender matter? towards fairness in dialogue systems. arXiv preprint arXiv:1910.10486 (2019).
- Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics 12 (2024), 157–173.
- G-Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2511–2522.
- Llms as narcissistic evaluators: When ego inflates evaluation scores. arXiv preprint arXiv:2311.09766 (2023).
- Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models’ Alignment. arXiv preprint arXiv:2308.05374 (2023).
- Teacher-Student Training for Debiasing: General Permutation Debiasing for Large Language Models. arXiv preprint arXiv:2403.13590 (2024).
- Gender bias in neural natural language processing. Logic, language, and security: essays dedicated to Andre Scedrov on the occasion of his 65th birthday (2020), 189–202.
- RecRanker: Instruction Tuning Large Language Model as Ranker for Top-k Recommendation. arXiv preprint arXiv:2312.16018 (2023).
- Large Language Models are Not Stable Recommender Systems. arXiv preprint arXiv:2312.15746 (2023).
- When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511 (2022).
- Christopher D Manning. 2009. An introduction to information retrieval. Cambridge university press.
- David Matsumoto and Linda Juang. 2016. Culture and psychology. Cengage Learning.
- On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661 (2020).
- Sources of Hallucination by Large Language Models on Inference Tasks. arXiv preprint arXiv:2305.14552 (2023).
- Using in-context learning to improve dialogue safety. arXiv preprint arXiv:2302.00871 (2023).
- A survey on bias and fairness in machine learning. ACM computing surveys (CSUR) 54, 6 (2021), 1–35.
- The dark side of news community forums: Opinion manipulation trolls. Internet Research (2018).
- FActScore: Fine-grained Atomic Evaluation of Factual Precision in Long Form Text Generation. arXiv preprint arXiv:2305.14251 (2023).
- Cross-task generalization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773 (2021).
- Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332 (2021).
- Mitigating harm in language models with conditional-likelihood filtration. arXiv preprint arXiv:2108.07790 (2021).
- Bias in data-driven artificial intelligence systems—An introductory survey. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 10, 3 (2020), e1356.
- Entity cloze by date: What LMs know about unseen entities. arXiv preprint arXiv:2205.02832 (2022).
- Hadas Orgad and Yonatan Belinkov. 2022. BLIND: Bias removal with no demographics. arXiv preprint arXiv:2212.10563 (2022).
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems 35 (2022), 27730–27744.
- Keyu Pan and Yawen Zeng. 2023. Do llms possess a personality? making the mbti test an amazing evaluation for large language models. arXiv preprint arXiv:2307.16180 (2023).
- Never too late to learn: Regularizing gender bias in coreference resolution. In Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining. 15–23.
- Fairrec: Two-sided fairness for personalized recommendations in two-sided platforms. In Proceedings of the web conference 2020. 1194–1204.
- Pouya Pezeshkpour and Estevam Hruschka. 2023. Large language models sensitivity to the order of options in multiple-choice questions. arXiv preprint arXiv:2308.11483 (2023).
- Cara L Phillips and Timothy R Vollmer. 2012. Generalized instruction following with pictorial prompts. Journal of Applied Behavior Analysis 45, 1 (2012), 37–54.
- Fairness in rankings and recommendations: an overview. The VLDB Journal (2022), 1–28.
- Measuring and narrowing the compositionality gap in language models. arXiv preprint arXiv:2210.03350 (2022).
- Summarization is (almost) dead. arXiv preprint arXiv:2309.09558 (2023).
- Reducing gender bias in word-level language models with a gender-equalizing loss function. arXiv preprint arXiv:1905.12801 (2019).
- T5score: Discriminative fine-tuning of generative evaluation metrics. arXiv preprint arXiv:2212.05726 (2022).
- Large language models are effective text rankers with pairwise ranking prompting. arXiv preprint arXiv:2306.17563 (2023).
- CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models. arXiv preprint arXiv:2403.03514 (2024).
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research 21, 140 (2020), 1–67.
- In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083 (2023).
- Educational Multi-Question Generation for Reading Comprehension. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 216–223.
- John Rawls. 1958. Justice as fairness. The philosophical review 67, 2 (1958), 164–194.
- Personality traits in large language models. arXiv preprint arXiv:2307.00184 (2023).
- Verbosity bias in preference labeling by large language models. arXiv preprint arXiv:2310.10076 (2023).
- Watermarking Makes Language Models Radioactive. arXiv preprint arXiv:2402.14904 (2024).
- Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207 (2021).
- Cognitive interference: Theories, methods, and findings. Routledge.
- Large pre-trained language models contain human-like biases of what is right and wrong to do. Nature Machine Intelligence 4, 3 (2022), 258–268.
- Generative Echo Chamber? Effects of LLM-Powered Search Systems on Diverse Information Seeking. In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (2024).
- Detecting pretraining data from large language models. arXiv preprint arXiv:2310.16789 (2023).
- Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652 (2023).
- Blenderbot 3: a deployed conversational agent that continually learns to responsibly engage. arXiv preprint arXiv:2208.03188 (2022).
- Amit Singhal et al. 2001. Modern information retrieval: A brief overview. IEEE Data Eng. Bull. 24, 4 (2001), 35–43.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617 (2023).
- ChatGPT for Conversational Recommendation: Refining Recommendations by Reprompting with Feedback. arXiv preprint arXiv:2401.03605 (2024).
- MoralDial: A framework to train and evaluate moral dialogue systems via moral discussions. arXiv preprint arXiv:2212.10720 (2022).
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. arXiv preprint arXiv:2304.09542 (2023).
- Recitation-augmented language models. arXiv preprint arXiv:2210.01296 (2022).
- Ekaterina Svikhnushina and Pearl Pu. 2023. Approximating Human Evaluation of Social Chatbots with Prompting. arXiv preprint arXiv:2304.05253 (2023).
- Blinded by Generated Contexts: How Language Models Merge Generated and Retrieved Contexts for Open-Domain QA? arXiv preprint arXiv:2401.11911 (2024).
- Found in the middle: Permutation self-consistency improves listwise ranking in large language models. arXiv preprint arXiv:2310.07712 (2023).
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Tom R Tyler and E Allan Lind. 2002. Procedural justice. In Handbook of justice research in law. Springer, 65–92.
- Tom R Tyler and Heather J Smith. 1995. Social justice and social movements. (1995).
- Saferdialogues: Taking feedback gracefully after conversational safety failures. arXiv preprint arXiv:2110.07518 (2021).
- WASA: Watermark-based source attribution for large language model-generated data. arXiv preprint arXiv:2310.00646 (2023).
- Dynamically disentangling social bias from task-oriented representations with adversarial attack. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3740–3750.
- Query2doc: Query Expansion with Large Language Models. arXiv preprint arXiv:2303.07678 (2023).
- Recagent: A novel simulation paradigm for recommender systems. arXiv preprint arXiv:2306.02552 (2023).
- Large language models are not fair evaluators. arXiv preprint arXiv:2305.17926 (2023).
- Toward fairness in text generation via mutual information minimization based on importance sampling. In International Conference on Artificial Intelligence and Statistics. PMLR, 4473–4485.
- Augmenting language models with long-term memory. Advances in Neural Information Processing Systems 36 (2024).
- Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation. arXiv preprint arXiv:2310.16738 (2023).
- Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171 [cs.CL]
- Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560 (2022).
- A survey on the fairness of recommender systems. ACM Transactions on Information Systems 41, 3 (2023), 1–43.
- Unveiling the implicit toxicity in large language models. arXiv preprint arXiv:2311.17391 (2023).
- Robert Wolfe and Aylin Caliskan. 2021. Low frequency names exhibit bias and overfitting in contextualizing language models. arXiv preprint arXiv:2110.00672 (2021).
- Compensatory debiasing for gender imbalances in language models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Ai-generated content (aigc): A survey. arXiv preprint arXiv:2304.06632 (2023).
- Exploring large language model for graph data understanding in online job recommendations. arXiv preprint arXiv:2307.05722 (2023).
- A Survey on Large Language Models for Recommendation. arXiv preprint arXiv:2305.19860 (2023).
- Minghao Wu and Alham Fikri Aji. 2023. Style over substance: Evaluation biases for large language models. arXiv preprint arXiv:2307.03025 (2023).
- P-MMF: Provider Max-min Fairness Re-ranking in Recommender System. In Proceedings of the ACM Web Conference 2023. 3701–3711.
- Do llms implicitly exhibit user discrimination in recommendation? an empirical study. arXiv preprint arXiv:2311.07054 (2023).
- Deep Learning for Matching in Search and Recommendation. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (SIGIR ’18). 1365–1368.
- Prompting Large Language Models for Recommender Systems: A Comprehensive Framework and Empirical Analysis. arXiv preprint arXiv:2401.04997 (2024).
- Retrieval meets long context large language models. arXiv preprint arXiv:2310.03025 (2023).
- AI-Generated Images Introduce Invisible Relevance Bias to Text-Image Retrieval. arXiv preprint arXiv:2311.14084 (2023).
- Search-in-the-Chain: Towards the Accurate, Credible and Traceable Content Generation for Complex Knowledge-intensive Tasks. arXiv preprint arXiv:2304.14732 (2023).
- List-aware reranking-truncation joint model for search and retrieval-augmented generation. arXiv preprint arXiv:2402.02764 (2024).
- Unsupervised Information Refinement Training of Large Language Models for Retrieval-Augmented Generation. arXiv preprint arXiv:2402.18150 (2024).
- Perils of Self-Feedback: Self-Bias Amplifies in Large Language Models. arXiv preprint arXiv:2402.11436 (2024).
- Bias and fairness in chatbots: An overview. arXiv preprint arXiv:2309.08836 (2023).
- Adept: A debiasing prompt framework. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 10780–10788.
- PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection. arXiv preprint arXiv:2310.20256 (2023).
- Unified detoxifying and debiasing in language generation via inference-time adaptive optimization. arXiv preprint arXiv:2210.04492 (2022).
- Investigating the effectiveness of task-agnostic prefix prompt for instruction following. arXiv preprint arXiv:2302.14691 (2023).
- Aligning with Human Judgement: The Role of Pairwise Preference in Large Language Model Evaluators.
- KoLA: Carefully Benchmarking World Knowledge of Large Language Models. arXiv preprint arXiv:2306.09296 (2023).
- Improving Language Models via Plug-and-Play Retrieval Feedback. arXiv preprint arXiv:2305.14002 (2023).
- Bartscore: Evaluating generated text as text generation. Advances in Neural Information Processing Systems 34 (2021), 27263–27277.
- Ali Zarifhonarvar. 2023. Economics of chatgpt: A labor market view on the occupational impact of artificial intelligence. SSRN 4350925 (2023).
- Should we attend more or less? modulating attention for fairness. arXiv preprint arXiv:2305.13088 (2023).
- Fairness in ranking: A survey. arXiv preprint arXiv:2103.14000 (2021).
- On generative agents in recommendation. arXiv preprint arXiv:2310.10108 (2023).
- Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems. 993–999.
- Agentcf: Collaborative learning with autonomous language agents for recommender systems. arXiv preprint arXiv:2310.09233 (2023).
- Causal intervention for leveraging popularity bias in recommendation. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 11–20.
- Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv:2309.01219 [cs.CL]
- Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045 (2023).
- Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems 36 (2024).
- Why Does ChatGPT Fall Short in Answering Questions Faithfully? arXiv preprint arXiv:2304.10513 (2023).
- Generative job recommendations with large language model. arXiv preprint arXiv:2307.02157 (2023).
- Causal-debias: Unifying debiasing in pretrained language models and fine-tuning via causal invariant learning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4227–4241.
- Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).
- Popularity bias in dynamic recommendation. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining. 2439–2449.
- Exploring ai ethics of chatgpt: A diagnostic analysis. arXiv preprint arXiv:2301.12867 (2023).
- Automatic true/false question generation for educational purpose. In Proceedings of the 17th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2022). 61–70.
- Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. Psychological Testing and Assessment Modeling 65, 2 (2023), 55–75.
- Sunhao Dai (22 papers)
- Chen Xu (186 papers)
- Shicheng Xu (36 papers)
- Liang Pang (94 papers)
- Zhenhua Dong (77 papers)
- Jun Xu (398 papers)