GPABench2: Benchmark for ChatGPT Content Detection
- GPABench2 is a large-scale dataset structured to differentiate human-written abstracts from GPT-generated text across Computer Science, Physics, and Humanities & Social Sciences.
- It employs systematic prompt engineering with tasks including direct generation, completion from human-seeded text, and post-generation polishing to simulate diverse academic writing scenarios.
- The CheckGPT framework leverages RoBERTa-large embeddings with BiLSTM and attention layers to achieve near-perfect detection accuracy, even under adversarial conditions.
GPABench2 is a large-scale open benchmarking dataset designed for rigorous evaluation of the detectability of ChatGPT-generated content within academic scientific abstracts. It systematically addresses the challenge of distinguishing human-authored text from GPT-3.5-turbo–written, –completed, and –polished text in the context of research literature across multiple disciplines and writing scenarios (Liu et al., 2023).
1. Dataset Composition and Construction
GPABench2 comprises 2,385,000 abstracts, structured to enable fine-grained comparative analysis of writing provenance across three primary domains—Computer Science (CS), Physics (PHX), and Humanities & Social Sciences (HSS). The benchmarks are formed from four categories:
- HUM: 150,000 fully human-written abstracts (50,000 per domain), sampled from pre-2019 arXiv (CS, PHX) and SSRN (HSS).
- GPT-WRI: 600,000 abstracts produced directly by ChatGPT from titles (4 prompts × 3 domains × 50,000).
- GPT-CPL: 600,000 GPT completions, with ChatGPT provided half-seed of human-written text and tasked to complete the remainder.
- GPT-POL: 600,000 GPT-polished abstracts, where ChatGPT is invoked to edit and polish human drafts.
An additional advanced prompt test set consists of 435,000 GPT-generated abstracts created using 10 distinct prompting techniques. Each writing scenario (WRI, CPL, POL) uses four zero-shot prompts for balanced domain representation. Data collection employs a pipeline of seeding, prompt engineering, API-based ChatGPT-3.5-turbo generation, strict post-generation filtering (removal of abstracts <50 or >500 words, and enforcement of coherent completion length), and automatic provenance labeling by dataset construction.
Default training/validation/testing splits use an 80%/20% division per task, domain, and prompt; unified and cross-domain models adhere to this partition, with option for 5% target data fine-tuning in cross-domain transfer assessments.
2. Benchmark Tasks and Evaluation Protocols
GPABench2 underpins a suite of benchmark classification tasks:
- Binary Classification:
- Task 1: HUM vs. GPT-WRI
- Task 2: HUM vs. GPT-CPL (evaluating only the GPT-completed segment)
- Task 3: HUM vs. GPT-POL
- Multi-class Classification:
- HUM, GPT-WRI, GPT-CPL, GPT-POL.
Evaluation metrics for binary tasks are explicitly defined:
- Accuracy:
- Precision (GPT as positive):
- Recall (True Positive Rate):
- F1 Score:
- Area Under ROC Curve (AUC):
Baselines include hand-engineered linguistic/statistical feature models (NELA: stylometric, complexity, bias, affect, morality, events; readability; POS-specific frequencies) evaluated with Gradient-Boosted Trees over an 80/20 split.
3. Detection Frameworks: CheckGPT and Baselines
The primary detection system, CheckGPT, is a neural architecture leveraging frozen RoBERTa-large representations combined with a task-specific classification head. The processing pipeline consists of:
- Tokenization: Byte-level BPE (max length 512).
- Encoding: Pre-trained RoBERTa-large, yielding with .
- Classification Head: Two BiLSTM layers (hidden size 256) followed by hierarchical attention, concatenation, dropout (), and a final dense layer with softmax activation for 2-class outputs.
Formally: with cross-entropy loss for optimization:
Training utilizes AdamW (), cosine-annealing scheduling, mixed-precision scaling, dropout, and early stopping on validation loss. Only the classification head (4M parameters) is updated. Comparative baselines include NELA+GB, fine-tuned BERT, and RoBERTa models.
4. Empirical Results
Task- and domain-specific F1 performance from Table 4.1 is summarized below (Prompt 1 only):
| CS | PHX | HSS | |
|---|---|---|---|
| Task 1 (WRI) | |||
| NELA+GB | 0.965 | 0.980 | 0.963 |
| BERT-FT | 0.999 | 0.999 | 0.998 |
| RoBERTa-FT | 0.999 | 0.999 | 0.997 |
| CheckGPT | 0.999 | 1.000 | 0.999 |
| Task 2 (CPL) | |||
| NELA+GB | 0.901 | 0.918 | 0.896 |
| BERT-FT | 0.992 | 0.983 | 0.992 |
| RoBERTa-FT | 0.970 | 0.995 | 0.995 |
| CheckGPT | 0.996 | 0.995 | 0.995 |
| Task 3 (POL) | |||
| NELA+GB | 0.774 | 0.794 | 0.798 |
| BERT-FT | 0.983 | 0.984 | 0.966 |
| RoBERTa-FT | 0.981 | 0.993 | 0.967 |
| CheckGPT | 0.993 | 0.994 | 0.993 |
Key findings include the highest detection performance for GPT-WRI (F1 > 0.99), slightly reduced for GPT-CPL, and most challenging for GPT-POL (F1 ≈ 0.99). Human evaluators perform only marginally above random chance (49% without reference; 62% with paired samples). Off-the-shelf detectors (e.g., GPTZero, ZeroGPT, OpenAI’s classifier) achieve less than 50% accuracy under this benchmark.
CheckGPT demonstrates high transferability: cross-domain F1 ≥ 0.978 without data from the target domain; fine-tuning on 5% of target data yields F1 ≥ 0.97 across all pairs. Prompt generalization is strong (TPR ≥ 97.7% when testing unseen prompts). Adversarial robustness is evaluated via rephrasing, sentence mixing, targeted copyediting, and prompt manipulation; CheckGPT retains TPR > 89% in most cases, though heavy lexical artifact removal (10 top-N substitutions) reduces TPR to 70.5% (Task 3).
5. Robustness and Limitations
GPABench2 provides systematic adversarial stress testing:
| Attack | Task 1 (CS) | Task 2 (PHX) | Task 3 (HSS) |
|---|---|---|---|
| Rephrase×2 | 100.0 | 99.9 | 99.9 |
| Mix-F₁ | 99.98 | 99.54 | 91.52 |
| CE-Top10 | 89.3 | 88.5 | 72.6 |
| PromptEng | 99.99 | 99.19 | 95.92 |
CheckGPT demonstrates high recall even under aggressive paraphrasing (rephrase ×2), but targeted artifact removal can induce significant performance drops. The framework shows limited efficacy for very short texts (e.g., single-sentence Q&A) and highly informal writing genres.
6. Significance, Recommendations, and Future Directions
GPABench2 constitutes the largest open benchmark for LLM-generated academic text detection, with parity across composition, completion, and polishing tasks, three scholarly domains, and systematic prompt diversification. The CheckGPT detection framework is compact (~4M trainable parameters), model-agnostic, and robust, yet demonstrates some susceptibility to high-degree human post-editing and strong artifact removal.
Recommendations and open research directions include:
- Extension of benchmarks to additional LLMs (e.g., GPT-4, multi-model detection).
- Adoption of dynamic domain adaptation to account for emergent prompts and genres.
- Exploration of hybrid approaches combining watermarking and learning-based detectors.
- Investigation into detection resilience under substantive human revision and paraphrase.
A plausible implication is that, while state-of-the-art classifiers effectively distinguish current GPT-3.5 content in long-form domains, future detection may require continuous methodological adaptation to evolving LLM outputs and their integration into increasingly human-like workflows (Liu et al., 2023).