Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CEB: Compositional Evaluation Benchmark for Fairness in Large Language Models (2407.02408v1)

Published 2 Jul 2024 in cs.CL and cs.LG

Abstract: As LLMs are increasingly deployed to handle various NLP tasks, concerns regarding the potential negative societal impacts of LLM-generated content have also arisen. To evaluate the biases exhibited by LLMs, researchers have recently proposed a variety of datasets. However, existing bias evaluation efforts often focus on only a particular type of bias and employ inconsistent evaluation metrics, leading to difficulties in comparison across different datasets and LLMs. To address these limitations, we collect a variety of datasets designed for the bias evaluation of LLMs, and further propose CEB, a Compositional Evaluation Benchmark that covers different types of bias across different social groups and tasks. The curation of CEB is based on our newly proposed compositional taxonomy, which characterizes each dataset from three dimensions: bias types, social groups, and tasks. By combining the three dimensions, we develop a comprehensive evaluation strategy for the bias in LLMs. Our experiments demonstrate that the levels of bias vary across these dimensions, thereby providing guidance for the development of specific bias mitigation methods.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Song Wang (313 papers)
  2. Peng Wang (831 papers)
  3. Tong Zhou (124 papers)
  4. Yushun Dong (47 papers)
  5. Zhen Tan (68 papers)
  6. Jundong Li (126 papers)
Citations (4)