Papers
Topics
Authors
Recent
Search
2000 character limit reached

GPABench2: Benchmark for ChatGPT Content Detection

Updated 26 January 2026
  • GPABench2 is a large-scale dataset structured to differentiate human-written abstracts from GPT-generated text across Computer Science, Physics, and Humanities & Social Sciences.
  • It employs systematic prompt engineering with tasks including direct generation, completion from human-seeded text, and post-generation polishing to simulate diverse academic writing scenarios.
  • The CheckGPT framework leverages RoBERTa-large embeddings with BiLSTM and attention layers to achieve near-perfect detection accuracy, even under adversarial conditions.

GPABench2 is a large-scale open benchmarking dataset designed for rigorous evaluation of the detectability of ChatGPT-generated content within academic scientific abstracts. It systematically addresses the challenge of distinguishing human-authored text from GPT-3.5-turbo–written, –completed, and –polished text in the context of research literature across multiple disciplines and writing scenarios (Liu et al., 2023).

1. Dataset Composition and Construction

GPABench2 comprises 2,385,000 abstracts, structured to enable fine-grained comparative analysis of writing provenance across three primary domains—Computer Science (CS), Physics (PHX), and Humanities & Social Sciences (HSS). The benchmarks are formed from four categories:

  • HUM: 150,000 fully human-written abstracts (50,000 per domain), sampled from pre-2019 arXiv (CS, PHX) and SSRN (HSS).
  • GPT-WRI: 600,000 abstracts produced directly by ChatGPT from titles (4 prompts × 3 domains × 50,000).
  • GPT-CPL: 600,000 GPT completions, with ChatGPT provided half-seed of human-written text and tasked to complete the remainder.
  • GPT-POL: 600,000 GPT-polished abstracts, where ChatGPT is invoked to edit and polish human drafts.

An additional advanced prompt test set consists of 435,000 GPT-generated abstracts created using 10 distinct prompting techniques. Each writing scenario (WRI, CPL, POL) uses four zero-shot prompts for balanced domain representation. Data collection employs a pipeline of seeding, prompt engineering, API-based ChatGPT-3.5-turbo generation, strict post-generation filtering (removal of abstracts <50 or >500 words, and enforcement of coherent completion length), and automatic provenance labeling by dataset construction.

Default training/validation/testing splits use an 80%/20% division per task, domain, and prompt; unified and cross-domain models adhere to this partition, with option for 5% target data fine-tuning in cross-domain transfer assessments.

2. Benchmark Tasks and Evaluation Protocols

GPABench2 underpins a suite of benchmark classification tasks:

  • Binary Classification:
    • Task 1: HUM vs. GPT-WRI
    • Task 2: HUM vs. GPT-CPL (evaluating only the GPT-completed segment)
    • Task 3: HUM vs. GPT-POL
  • Multi-class Classification:
    • HUM, GPT-WRI, GPT-CPL, GPT-POL.

Evaluation metrics for binary tasks are explicitly defined:

  • Accuracy: Acc=TP+TNTP+TN+FP+FN\mathrm{Acc} = \frac{TP + TN}{TP + TN + FP + FN}
  • Precision (GPT as positive): Prec=TPTP+FP\mathrm{Prec} = \frac{TP}{TP + FP}
  • Recall (True Positive Rate): Rec=TPTP+FN\mathrm{Rec} = \frac{TP}{TP + FN}
  • F1 Score: F1=2PrecRecPrec+Rec\mathrm{F1} = \frac{2\,\mathrm{Prec}\,\mathrm{Rec}}{\mathrm{Prec} + \mathrm{Rec}}
  • Area Under ROC Curve (AUC): AUC=01TPR(FPR1(t))dt\mathrm{AUC} = \int_0^1 \mathrm{TPR}\bigl(\mathrm{FPR}^{-1}(t)\bigr)\,dt

Baselines include hand-engineered linguistic/statistical feature models (NELA: stylometric, complexity, bias, affect, morality, events; readability; POS-specific frequencies) evaluated with Gradient-Boosted Trees over an 80/20 split.

3. Detection Frameworks: CheckGPT and Baselines

The primary detection system, CheckGPT, is a neural architecture leveraging frozen RoBERTa-large representations combined with a task-specific classification head. The processing pipeline consists of:

  1. Tokenization: Byte-level BPE (max length 512).
  2. Encoding: Pre-trained RoBERTa-large, yielding E=[e1,,en],\mathbf{E}=[e_1,\dots,e_n], with eiR1024e_i\in\mathbb{R}^{1024}.
  3. Classification Head: Two BiLSTM layers (hidden size 256) followed by hierarchical attention, concatenation, dropout (p=0.5p=0.5), and a final dense layer with softmax activation for 2-class outputs.

Formally: X=Tok(s) E=Enc(X) h(1)=BiLSTM1(E),r(1)=iαi(1)hi(1) h(2)=BiLSTM2(h(1)),r(2)=iαi(2)hi(2) r=[r(1)r(2)] y^=Softmax(Wr+b)\begin{align*} \mathbf{X} &= \mathrm{Tok}(s) \ \mathbf{E} &= \mathrm{Enc}(\mathbf{X}) \ h^{(1)} &= \mathrm{BiLSTM}_1(\mathbf{E}),\quad r^{(1)} = \sum_i \alpha^{(1)}_i\,h^{(1)}_i \ h^{(2)} &= \mathrm{BiLSTM}_2(h^{(1)}),\quad r^{(2)} = \sum_i \alpha^{(2)}_i\,h^{(2)}_i \ r &= [r^{(1)}\oplus r^{(2)}] \ \hat y &= \mathrm{Softmax}(W\,r + b) \end{align*} with cross-entropy loss for optimization: L=[y0logy^0+y1logy^1]\mathcal{L} = -\Bigl[y_0\log\hat y_0 + y_1\log\hat y_1\Bigr]

Training utilizes AdamW (LR=2×104\text{LR}=2\times10^{-4}), cosine-annealing scheduling, mixed-precision scaling, dropout, and early stopping on validation loss. Only the classification head (\sim4M parameters) is updated. Comparative baselines include NELA+GB, fine-tuned BERT, and RoBERTa models.

4. Empirical Results

Task- and domain-specific F1 performance from Table 4.1 is summarized below (Prompt 1 only):

CS PHX HSS
Task 1 (WRI)
NELA+GB 0.965 0.980 0.963
BERT-FT 0.999 0.999 0.998
RoBERTa-FT 0.999 0.999 0.997
CheckGPT 0.999 1.000 0.999
Task 2 (CPL)
NELA+GB 0.901 0.918 0.896
BERT-FT 0.992 0.983 0.992
RoBERTa-FT 0.970 0.995 0.995
CheckGPT 0.996 0.995 0.995
Task 3 (POL)
NELA+GB 0.774 0.794 0.798
BERT-FT 0.983 0.984 0.966
RoBERTa-FT 0.981 0.993 0.967
CheckGPT 0.993 0.994 0.993

Key findings include the highest detection performance for GPT-WRI (F1 > 0.99), slightly reduced for GPT-CPL, and most challenging for GPT-POL (F1 ≈ 0.99). Human evaluators perform only marginally above random chance (49% without reference; 62% with paired samples). Off-the-shelf detectors (e.g., GPTZero, ZeroGPT, OpenAI’s classifier) achieve less than 50% accuracy under this benchmark.

CheckGPT demonstrates high transferability: cross-domain F1 ≥ 0.978 without data from the target domain; fine-tuning on 5% of target data yields F1 ≥ 0.97 across all pairs. Prompt generalization is strong (TPR ≥ 97.7% when testing unseen prompts). Adversarial robustness is evaluated via rephrasing, sentence mixing, targeted copyediting, and prompt manipulation; CheckGPT retains TPR > 89% in most cases, though heavy lexical artifact removal (10 top-N substitutions) reduces TPR to 70.5% (Task 3).

5. Robustness and Limitations

GPABench2 provides systematic adversarial stress testing:

Attack Task 1 (CS) Task 2 (PHX) Task 3 (HSS)
Rephrase×2 100.0 99.9 99.9
Mix-F₁ 99.98 99.54 91.52
CE-Top10 89.3 88.5 72.6
PromptEng 99.99 99.19 95.92

CheckGPT demonstrates high recall even under aggressive paraphrasing (rephrase ×2), but targeted artifact removal can induce significant performance drops. The framework shows limited efficacy for very short texts (e.g., single-sentence Q&A) and highly informal writing genres.

6. Significance, Recommendations, and Future Directions

GPABench2 constitutes the largest open benchmark for LLM-generated academic text detection, with parity across composition, completion, and polishing tasks, three scholarly domains, and systematic prompt diversification. The CheckGPT detection framework is compact (~4M trainable parameters), model-agnostic, and robust, yet demonstrates some susceptibility to high-degree human post-editing and strong artifact removal.

Recommendations and open research directions include:

  • Extension of benchmarks to additional LLMs (e.g., GPT-4, multi-model detection).
  • Adoption of dynamic domain adaptation to account for emergent prompts and genres.
  • Exploration of hybrid approaches combining watermarking and learning-based detectors.
  • Investigation into detection resilience under substantive human revision and paraphrase.

A plausible implication is that, while state-of-the-art classifiers effectively distinguish current GPT-3.5 content in long-form domains, future detection may require continuous methodological adaptation to evolving LLM outputs and their integration into increasingly human-like workflows (Liu et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to GPABench2.