- The paper introduces a comprehensive benchmark (C-ReD) that incorporates real-world, domain-diverse prompts to improve Chinese AI-generated text detection.
- It employs nine distinct LLMs, combining international and Chinese models, to generate challenging data for both zero-shot and supervised detection methods.
- Empirical results reveal significant AUROC improvements and underscore domain-specific challenges in detecting advanced, coherent AI-generated content.
C-ReD: A Benchmark Addressing Chinese AI-Text Detection with Real-World Prompt Diversity
Motivation and Context
The proliferation of LLMs capable of producing fluent, context-sensitive Chinese text has intensified risks associated with AI-generated content in high-stakes and open-domain applications, including phishing, plagiarism, and disinformation campaigns. The current state of Chinese detection research is hampered by three primary deficiencies in existing benchmarks: (1) insufficient generator diversityโdatasets typically feature only ChatGPT-based outputs, neglecting the rapidly growing ecosystem of domestic LLMs; (2) domain and prompt homogeneity, often restricted to simple QA-style content; (3) unrealistic application scenarios, due to artificial or template-driven prompt construction. As such, the literature lacks a robust testbed that reflects the heterogeneity of real-world usage in the Chinese linguistic context.
C-ReD Dataset Construction
C-ReD introduces systematic advances along several axes of benchmark design. Human-authored text is sourced from five prominent domains where LLM deployment is operationally significant: news, Q&A, film reviews, composition (Gaokao essays), and academic writing, with an additional set for Traditional Chinese news. Texts are aligned for length and register through comprehensive preprocessing.
AI-generated counterparts are produced using nine LLMs, balancing state-of-the-art international models (GPT-3.5-turbo, GPT-4o, Gemini-2.5-Flash, Claude-3.5-Haiku) and five leading Chinese commercial and open-weight models (Deepseek-V3, Deepseek-R1 (CoT-mode), Qwen2.5, Qwen3, Doubao-1.5-Pro). All models are accessed exclusively via API in a black-box setup mirroring actual cloud inference constraints.
Prompt design is domain-adapted and constructed through iterative LLM-assisted development, with manual quality control and targeted variation across genre and stylistic axis. This ensures prompt realism matching deployment conditions. An extensive automated and manual filtering pipeline standardizes both human and AI corpora, enforcing constraints on text length, semantic homogeneity, and linguistic purity. The result is a dataset of ~129k samples with balanced source/model representation and a unified schema supporting robust attribution and provenance analysis.
Experimental Protocols and Detectors
C-ReD's rigorous benchmarking is organized around multiple evaluation paradigms:
- Zero-shot detectors are tested using statistical methods (Log-Likelihood, Entropy, Log-Rank, LRR, Fast-DetectGPT, Lastde/Lastde++, DNA-DetectLLM), evaluated under fixed reference model conditions, and without model/domain labels at inference.
- Supervised detectors encompass both legacy systems (OpenAI Detector, RoBERTa) and advanced architectures (RADAR, ReMoDetect, IMBD), trained and evaluated both in-distribution (ID; seen generators) and out-of-distribution (OOD; held-out generators).
- LLM-as-Detector protocols leverage prompt-based binary classification using LLMs themselves, and assess the effect of context provision (zero-shot, few-shot, and descriptive stylistic prompts).
Robustness is further assessed via cross-domain generalization (training/testing on different content verticals with fixed generators), prompt complexity ablation (in academic writing), and cross-script transferability (Simplified to Traditional Chinese). Generalization to external datasets (e.g., M4) is directly quantified.
Performance is measured via accuracy and AUROC, emphasizing ranking capacity across class imbalance and under generator/domain shifts.
Results and Empirical Findings
Domain/Generator Influence
Detection difficulty is highly domain- and generator-dependent. The Q&A and Film Review domains exhibit strong statistical divergences between human and machine text, yielding high AUROC for both zero-shot and supervised methods. Conversely, News and especially Academic Writing domains generated by advanced LLMs (e.g., Deepseek-R1, Qwen3) show markedly increased detection difficulty. Outputs from reasoning-intensive, highly coherent generators substantially reduce method discriminability, with AUROCs on challenging domains dropping sharply for standard detectors.
Significant between-generator variability is observed: models employing intricate reasoning and human-like rhetorical structures (Deepseek-R1 with CoT) obfuscate typical AI-generation signatures. Zero-shot detectors relying on architectural prior fail in these cases due to reference-style mismatch. Legacy supervised models (RoBERTa trained on older data) also fail on modern generative distributions, highlighting domain/model obsolescence as a key bottleneck.
Efficacy of Dataset Diversity
Fine-tuning on C-ReD induces an unequivocal increase in both ID and OOD detection performance for supervised detectors. For example, RoBERTa-large's AUROC post-finetuning is consistently >0.96 on all test domains and generators, including held-out commercial LLMs such as Claude-3.5-Haiku and Gemini-2.5-Flash, indicating improved generalization and mitigation of overfitting to specific generators or domains. However, minor but persistent gaps between ID and OOD performance validate that unseen generators with new stylistic baselines remain an open challenge.
Strong transfer performance is further corroborated by significant AUROC increasesโin some cases >0.2โon the out-of-domain M4 Chinese dataset (ChatGPT/davinci-003 generations), confirming the practical utility of fine-tuning on C-ReD for real-world deployment.
Domain and Prompt Generalization
Cross-domain generalization is sensitive to both source and target domains. Training on information-dense, less stylistically stylized domains (News, Academic Writing) yields the most stable transfer. Domains with narrow stylistic focus (Film Review, Q&A) generalize poorly, confirming the importance of domain diversity and task realism for robust detection.
Ablation studies reveal that for strong LLMs (GPT-4o, Qwen2.5), prompt complexity has minimal impact on detectability by fine-tuned detectors: the difference in AUROC between original and simplified prompts is <4%, non-significant for most models. This demonstrates that highly capable LLMs can match human stylistic complexity even with minimal prompt guidance, further complicating the detection task.
LLMs as Detectors
Out-of-the-box (zero-shot prompt) LLMs fail at reliable discrimination, often underperforming classical statistical methods. Performance improves substantially with context-augmented and stylistically descriptive prompts, indicating that LLMs encode relevant style features but require explicit cueing to utilize them for detection. Nevertheless, variability remains, particularly on deep-reasoning or structurally complex outputs, limiting the immediate viability of LLMs for fully-automated Chinese AI-text detection.
Robustness Across Chinese Scripts
Evaluations on Traditional Chinese news report parallel findings: supervised detectors fine-tuned on C-ReD translate well, and generator domain remains the critical determinant of success/failure. Under language/script shifts, detection challenges are largely model-intrinsic rather than orthographic.
Limitations and Prospective Directions
C-ReD's coverage is nonetheless bounded: it is monolingual and focused on cataloged use cases. With the pace of LLM development and increasing generator diversity (including adversarial or style-transfer mutations aimed at evading detectors), continual benchmark updates will be necessary. Further improvements should address long-tail writing styles, domain expansion (including code, legal, social media genres), and adversarial robustness. Expansion to multilingual and code-switched data is warranted to meet broader practical demands.
Conclusion
C-ReD provides the most comprehensive Chinese benchmark for AI-generated text detection to date, addressing critical shortfalls of legacy datasets in model diversity, domain realism, and prompt authenticity. Its empirical analysis reveals the centrality of both domain and generator characteristics to detection difficulty, the limits of zero-shot and legacy models under contemporary generative distributions, and the strong practical gains enabled by diverse, realistic training data. C-ReD establishes a rigorous foundation for future Chinese AI-text detection research, development, and evaluation, with direct relevance to security, academic integrity, and information quality control domains.