Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Do Large Language Models Rank Fairly? An Empirical Study on the Fairness of LLMs as Rankers (2404.03192v2)

Published 4 Apr 2024 in cs.IR and cs.CL

Abstract: The integration of LLMs in information retrieval has raised a critical reevaluation of fairness in the text-ranking models. LLMs, such as GPT models and Llama2, have shown effectiveness in natural language understanding tasks, and prior works (e.g., RankGPT) have also demonstrated that the LLMs exhibit better performance than the traditional ranking models in the ranking task. However, their fairness remains largely unexplored. This paper presents an empirical study evaluating these LLMs using the TREC Fair Ranking dataset, focusing on the representation of binary protected attributes such as gender and geographic location, which are historically underrepresented in search outcomes. Our analysis delves into how these LLMs handle queries and documents related to these attributes, aiming to uncover biases in their ranking algorithms. We assess fairness from both user and content perspectives, contributing an empirical benchmark for evaluating LLMs as the fair ranker.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Persistent anti-muslim bias in large language models. In AIES ’21: AAAI/ACM Conference on AI, Ethics, and Society, Virtual Event, USA, May 19-21, 2021, pages 298–306. ACM.
  2. Designing fair ranking schemes. In Proceedings of the 2019 International Conference on Management of Data, page 1259–1276, New York, NY, USA. Association for Computing Machinery.
  3. Constitutional ai: Harmlessness from ai feedback.
  4. Fairness in recommendation ranking through pairwise comparisons. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, page 2212–2220, New York, NY, USA. Association for Computing Machinery.
  5. Equity of attention: Amortizing individual fairness in rankings. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, page 405–414, New York, NY, USA. Association for Computing Machinery.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  7. Sparks of artificial general intelligence: Early experiments with GPT-4. CoRR, abs/2303.12712.
  8. Ranking with fairness constraints. In 45th International Colloquium on Automata, Languages, and Programming, 2018, July 9-13, 2018, Prague, Czech Republic, volume 107, pages 28:1–28:15. Schloss Dagstuhl - Leibniz-Zentrum für Informatik.
  9. Marked personas: Using natural language prompts to measure stereotypes in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 1504–1532. Association for Computational Linguistics.
  10. Fairnas: Rethinking evaluation fairness of weight sharing neural architecture search. In International Conference on Computer Vision.
  11. Overview of the trec 2021 fair ranking track. In The Thirtieth Text REtrieval Conference (TREC 2021) Proceedings.
  12. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. In Findings.
  13. Fabian Haak and Philipp Schaer. 2022. Auditing search query suggestion bias through recursive algorithm interrogation. In 14th ACM Web Science Conference 2022, WebSci ’22, page 219–227, New York, NY, USA. Association for Computing Machinery.
  14. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  15. Social biases in NLP models as barriers for persons with disabilities. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5491–5501, Online. Association for Computational Linguistics.
  16. Mistral 7b.
  17. Text-to-text multi-view learning for passage re-ranking. In SIGIR, pages 1803–1807. ACM.
  18. Jon Kleinberg and Manish Raghavan. 2018. Selection Problems in the Presence of Implicit Bias. In 9th Innovations in Theoretical Computer Science Conference, volume 94 of Leibniz International Proceedings in Informatics, pages 33:1–33:17, Dagstuhl, Germany. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
  19. Gender bias and stereotypes in large language models. In Proceedings of The ACM Collective Intelligence Conference, CI ’23, page 12–24, New York, NY, USA. Association for Computing Machinery.
  20. ifair: Learning individually fair data representations for algorithmic decision making. 2019 IEEE 35th International Conference on Data Engineering, pages 1334–1345.
  21. Holistic evaluation of language models. CoRR, abs/2211.09110.
  22. Diversified subgraph query generation with group fairness. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, WSDM ’22, page 686–694, New York, NY, USA. Association for Computing Machinery.
  23. Fine-tuning llama for multi-stage text retrieval. CoRR, abs/2310.08319.
  24. Zero-shot listwise document reranking with a large language model. CoRR, abs/2305.02156.
  25. Peft: State-of-the-art parameter-efficient fine-tuning methods. https://github.com/huggingface/peft.
  26. CrowS-pairs: A challenge dataset for measuring social biases in masked language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1953–1967, Online. Association for Computational Linguistics.
  27. Rodrigo Nogueira and Kyunghyun Cho. 2020. Passage re-ranking with bert.
  28. Document ranking with a pretrained sequence-to-sequence model. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 708–718, Online. Association for Computational Linguistics.
  29. Multi-stage document ranking with BERT. CoRR, abs/1910.14424.
  30. OpenAI. 2023. Gpt-4 technical report.
  31. Training language models to follow instructions with human feedback.
  32. BBQ: A hand-built bias benchmark for question answering. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, pages 2086–2105. Association for Computational Linguistics.
  33. Red teaming language models with language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 3419–3448, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  34. The expando-mono-duo design pattern for text ranking with pretrained sequence-to-sequence models. CoRR, abs/2101.05667.
  35. Large language models are effective text rankers with pairwise ranking prompting. CoRR, abs/2306.17563.
  36. Aida Ramezani and Yang Xu. 2023. Knowledge of cultural moral norms in large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 428–446. Association for Computational Linguistics.
  37. Improving passage retrieval with zero-shot question generation. In EMNLP, pages 3781–3797. Association for Computational Linguistics.
  38. Nlpositionality: Characterizing design biases of datasets and models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 9080–9102. Association for Computational Linguistics.
  39. Large language models are not yet human-level evaluators for abstractive summarization. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4215–4233, Singapore. Association for Computational Linguistics.
  40. Ashudeep Singh and Thorsten Joachims. 2018. Fairness of exposure in rankings. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’18, page 2219–2228, New York, NY, USA. Association for Computing Machinery.
  41. Online set selection with fairness and diversity constraints. In Proceedings of the EDBT Conference.
  42. Is ChatGPT good at search? investigating large language models as re-ranking agents. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14918–14937, Singapore. Association for Computational Linguistics.
  43. Found in the middle: Permutation self-consistency improves listwise ranking in large language models. CoRR, abs/2310.07712.
  44. Llama 2: Open foundation and fine-tuned chat models.
  45. Decodingtrust: A comprehensive assessment of trustworthiness in GPT models. CoRR, abs/2306.11698.
  46. A meta-learning approach to fair ranking. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’22, page 2539–2544, New York, NY, USA. Association for Computing Machinery.
  47. A unified meta-learning framework for fair ranking with curriculum learning. IEEE Transactions on Knowledge and Data Engineering.
  48. Balanced ranking with diversity constraints. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, pages 6035–6042. International Joint Conferences on Artificial Intelligence Organization.
  49. Measuring fairness in ranked outputs. In Proceedings of the 29th International Conference on Scientific and Statistical Database Management, New York, NY, USA. Association for Computing Machinery.
  50. Fa*ir: A fair top-k ranking algorithm. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, page 1569–1578, New York, NY, USA. Association for Computing Machinery.
  51. Meike Zehlike and Carlos Castillo. 2020. Reducing disparate exposure in ranking: A learning to rank approach. In Proceedings of The Web Conference 2020, WWW ’20, page 2849–2855, New York, NY, USA. Association for Computing Machinery.
  52. Matching code and law: Achieving algorithmic fairness with optimal transport. Data Min. Knowl. Discov., 34(1):163–200.
  53. Fairness in ranking, part i: Score-based ranking. ACM Comput. Surv., 55(6).
  54. Is chatgpt fair for recommendation? evaluating fairness in large language model recommendation. In Proceedings of the 17th ACM Conference on Recommender Systems, RecSys ’23, page 993–999, New York, NY, USA. Association for Computing Machinery.
  55. Beyond yes and no: Improving zero-shot LLM rankers via scoring fine-grained relevance labels. CoRR, abs/2310.14122.
  56. Open-source large language models are strong zero-shot query likelihood models for document ranking. In EMNLP, pages 8807–8817. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuan Wang (251 papers)
  2. Xuyang Wu (31 papers)
  3. Hsin-Tai Wu (12 papers)
  4. Zhiqiang Tao (26 papers)
  5. Yi Fang (151 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.