- The paper introduces BAID, a benchmark that systematically evaluates bias in AI-generated text detectors across sociolinguistic factors.
- It employs a diagnostic dataset of over 208k text pairs to compare authentic human writings and AI rewrites conditioned on seven bias axes.
- Results reveal significant subgroup disparities in precision, recall, and F1 scores, highlighting the need for bias-aware, fair detector designs.
Systematic Bias Evaluation of AI-Generated Text Detectors with BAID
Motivation and Context
The proliferation of LLMs like GPT-4 and LLaMA has driven substantial adoption of AI-generated text detectors in education, publishing, and online moderation. These detectors primarily operate under a binary paradigm—classifying texts as either human- or AI-generated. While substantial efforts have focused on improving detection accuracy, there is inadequate attention to the systematic evaluation of fairness across sociolinguistic dimensions. Empirical evidence such as [liang2023gptdetectorsbiasednonnative] has demonstrated the propensity for certain detectors to misclassify non-native writings as AI-generated, exposing real harms in deployment. This paper proposes the Bias Assessment of AI Detectors (BAID) benchmark to systematically address these gaps.
Benchmark Design and Methodology
BAID constructs a comprehensive, large-scale diagnostic dataset (208k+ pairs) targeting seven bias axes: demographics (race/ethnicity, gender, socioeconomic and disability status, ELL), age, educational grade level, dialect, formality, political leaning, and topic. For each subgroup, authentic human-written texts are collected and matched with subgroup-conditioned AI-generated rewrites using prompt-crafted instructions for LLMs (GPT-4.1, Claude Sonnet 3.7) that enforce semantic invariance while varying stylistic attributes.
Rigorous data validation ensures high semantic alignment (cos-similarity >0.85) and filters out low-quality generations, preserving subgroup signal integrity. By explicitly controlling for content and subgroup writing style, BAID isolates bias effects attributable to linguistic and sociocultural factors.
Evaluation Protocol
Four open-source detectors spanning neural (Desklib, E5-small, Radar) and statistical (ZipPy) architectures are benchmarked using default thresholds. Performance metrics (precision, recall, F1) are calculated per subgroup on human-written text (primary focus for fairness), and bias analyses are conducted along each dimension.
Notably, the evaluation isolates subgroup effects strictly in human-authored samples—synthetic subgroup conditioning in AI outputs is treated as a calibration assessment, not inherent bias measurement.
Empirical Results
- Precision: Neural detectors (Desklib, E5) achieve high precision (0.97–0.99) on demographic and grade-level subgroups, but exhibit sharp declines for dialectal and informal writing (e.g. 0.44 for Singlish, 0.16 for GenZ for Desklib). ZipPy's statistical compression approach exhibits lower and less consistent precision, particularly on short and non-standard texts.
- Recall: Desklib sustains robust recall (0.83–0.96) on demographic/grade-level axes, but underperforms for dialects (0.12–0.35) and informal registers. E5's recall is markedly low for demographic/political groups (0.03–0.45) but high for certain dialects. ZipPy achieves high recall for age, dialect, and topic (0.95–0.99), but collapses on demographic and grade-level texts, reflecting sensitivity to text length and lexical diversity.
- F1: Desklib and Radar rank highest for aggregate F1 on standard demographic axes, yet both suffer substantial drops on dialect and formality. ZipPy's F1 is lowest overall for demographic and grade-level, but closes the gap in dialectal/topic subgroups due to recall dominance.
Bias Characterization
- Disparities in detection efficacy are not uniform; underrepresented or non-standard subgroups (AAVE, Singlish, GenZ) consistently display reduced recall and F1, indicating elevated false-positive risks when texts diverge from canonical norms.
- Dialect and informal style constitute the axes of maximal bias—existing detectors are not equitably robust in these settings.
- Statistical methods (ZipPy) exhibit input-length dependency, which introduces additional bias in detection results for shorter or idiosyncratic texts.
Additional Insights
- Evaluations on AI-generated texts (prompt-conditioned for subgroup style) reaffirm detectors’ reliability for synthetic outputs (uniformly high recall), but do not reveal meaningful subgroup bias as these texts do not encode genuine demographic identity.
- Detector architecture and input characteristics (length, style) interact non-trivially, undermining simple cross-system comparisons.
Limitations
The benchmark analysis is constrained to open and English-only detectors, excluding large-scale commercial systems and multilingual contexts. Detector-specific sensitivity to length, prompt design, and pretraining corpus biases limits direct comparability. Bias types unaccounted for (e.g. code-switching, cross-lingual writing) merit further study.
Implications and Future Directions
The findings provide clear evidence that aggregate accuracy statistics mask critical fairness failures at subgroup level. Existing detectors are inadequately calibrated for linguistic and demographic diversity, inadvertently exposing underrepresented writers to increased misclassification risk. This necessitates:
- Bias-aware detector design and auditing prior to real-world deployment, especially in high-stakes contexts (education, publishing).
- Expansion of detector training corpora to include diverse writing styles, dialectal variants, and non-standard registers.
- Dynamic thresholding and meta-learning approaches for context-sensitive calibration.
- Development of robust, multilingual, and hybrid models that generalize across sociolinguistic boundaries.
BAID establishes a scalable, transparent methodology for auditing AI detectors and should be extended to further languages and model architectures as the detection landscape evolves.
Conclusion
BAID provides a systematic, large-scale benchmark for evaluating the fairness of AI-generated text detectors across salient bias axes. The empirical evidence reveals that subgroup disparities—particularly in recall—pose tangible equity risks in detector deployments. Neural architectures perform more stably than statistical ones, but none are immune to bias across all dimensions. This work underscores the necessity for bias-aware evaluation, diverse training data, and architectural innovation to ensure equitable, reliable performance of AI detectors in diverse real-world settings.
References:
BAID: A Benchmark for Bias Assessment of AI Detectors (2512.11505)
GPT detectors are biased against non-native English writers (Liang et al., 2023)
ZipPy: Fast method to classify text as AI or human-generated
RADAR: Robust AI-Text Detection via Adversarial Learning (Hu et al., 2023)
Desklib AI Text Detector v1.01
E5-small LoRA AI-generated Detector