DetectGPT: Zero-shot AI Text Detection

Updated 5 November 2025

DetectGPT is a zero-shot AI text detector that leverages the local curvature of language model log-probabilities to differentiate machine-generated from human-authored text.
It evaluates semantic perturbations of passages to calculate a normalized detection score that reflects the statistical signature of AI-generated content.
The method demonstrates high AUROC performance across domains while facing challenges like computational cost, model dependency, and vulnerability to evasion techniques.

DetectGPT is a zero-shot machine-generated text detector that leverages the local curvature of a LLM’s log-probability surface to distinguish between LLM-generated and human-authored text. The method is based on the empirical observation that, for a given LLM, its own generations tend to reside in regions of high negative curvature with respect to the model's probability function, while human-written text is more likely to occupy regions with lower or non-negative curvature. DetectGPT does not require supervised classifier training, labeled datasets, or watermarking; it operates by sampling random semantic perturbations of a passage and evaluating the relative likelihoods under the model of interest. Since its introduction, DetectGPT has served as a canonical baseline for zero-shot LLM text detection and has influenced a large body of subsequent research targeting robustness, efficiency, and evasion resistance in AI-generated text detection frameworks.

1. Theoretical Foundations: Probability Curvature and the Core Algorithm

DetectGPT’s core hypothesis is that, for a LLM $p$ , a sample $x \sim p$ (i.e., generated by the model), will tend to occupy a local maximum of $\log p(x)$ in the neighborhood of semantically similar passages. Mathematically, the detection criterion is defined as the perturbation discrepancy: $d = \log p(x) - \mathbb{E}_{\tilde{x} \sim q(\cdot|x)} [\log p(\tilde{x})]$ where $q(\cdot|x)$ is a perturbation distribution generating semantically similar variants of $x$ (e.g., by masked LLM span filling). If $d$ is large, $x$ is considered likely to be machine-generated.

This discrepancy $d$ closely approximates the negative trace of the Hessian of $\log p$ at $x$ (i.e., the local curvature)—a property the authors relate using Hutchinson’s trace estimator: $-\text{tr}(H_{\log p}(x)) \approx 2\log p(x) - \mathbb{E}_{\mathbf{z}}[ \log p(x+\mathbf{z}) + \log p(x-\mathbf{z}) ]$ Thus, DetectGPT operationalizes the detection task as identifying whether a passage is situated at a local maximum (negative curvature) of the LLM’s log-likelihood landscape.

2. Practical Workflow and Implementation

The practical DetectGPT workflow includes the following steps:

Perturbation Sampling: Generate $k$ semantic perturbations of the candidate passage $x$ via random mask-and-fill operations (typically 2-word masks across 15% of the passage), utilizing models like T5-3B.
Likelihood Evaluation: Compute the log-probability of $x$ and each $\tilde{x}_i$ under target LLM $p$ .
Score and Standardization: Calculate
- mean perturbation log-probability $\tilde{\mu} = \frac{1}{k}\sum_{i=1}^k \log p(\tilde{x}_i)$ ,
- sample standard deviation $\tilde{\sigma}_x$ of $\{\log p(\tilde{x}_i)\}$ ,
- normalized detection statistic $s = \frac{ \log p(x) - \tilde{\mu} }{ \sqrt{ \tilde{\sigma}_x^2 } }$ .
Thresholding: Classify $x$ as model-generated if $s > \epsilon$ for a chosen threshold $\epsilon$ (typically determined by validation or ROC analysis).

DetectGPT can be implemented using publicly available masked LLMs (for perturbations) and LLMs with access to log-probabilities (for scoring). Approximate pseudocode:

perturbations = [sample a perturbed text with T5, k times]
perturb_logprobs = [log_p(model, x_tilde) for x_tilde in perturbations]
mean_perturb = mean(perturb_logprobs)
std_perturb = std(perturb_logprobs)
score = (log_p(model, x) - mean_perturb) / std_perturb
if score > threshold:
    return "Model-generated"
else:
    return "Human-written"

3. Empirical Performance and Evaluation

DetectGPT was evaluated across a variety of public datasets (e.g., XSum news, SQuAD Wikipedia, WritingPrompts, PubMedQA, multilingual WMT16). It targets outputs from a range of LLMs (GPT-2, OPT-2.7B, GPT-Neo-2.7B, GPT-J, GPT-NeoX, GPT-3, Jurassic-2). Key findings:

Superior AUROC: Notably outperformed prior zero-shot detectors and strong unsupervised baselines (e.g., average token log-probability, rank, entropy), e.g., improving GPT-NeoX detection on XSum from 0.81 to 0.95 AUROC.
Domain Generalization: Maintained top performance across news, creative writing, question answering, and biomedical data, in contrast to supervised detectors that degrade under domain or LLM shifts.
Minimal Impact from Decoding Strategy Variance: Robust to different LLM sampling methods (top- $p$ , top- $k$ ), paraphrasing, and moderate textual editing (up to 25% content replaced).

Performance stabilizes with $\sim$ 100 perturbations per passage. DetectGPT is robust to domain, genre, or language shifts and does not require parallel corpora.

4. Strengths, Limitations, and Evasion Vulnerabilities

Strengths

Zero-shot operation: No supervised training or LLM fine-tuning required.
Generalizable: Consistent performance even when test samples differ from training distribution.
No labeled data or watermarking required: Uses only LLM probability outputs and public perturbation models.

Limitations

Computational cost: Requires $O(k)$ forward passes of the scoring LLM per passage (e.g., $k = 100$ perturbations for each evaluation).
White-box dependency: Highest accuracy if the scoring model matches the generation model; degraded accuracy in black-box or model-mismatched scenarios.
Likelihood access: Infeasible if log-probabilities are not exposed (e.g., restricted commercial APIs); limits utility for models like ChatGPT with black-box APIs.
Perturbation model quality: Detection power depends on the ability of the perturbation model (e.g., T5) to produce semantically similar, fluent alternatives.
Vulnerability to paraphrasing: Adversarial paraphrasing can reduce DetectGPT detection rates from >70% to near chance (~5%) at a fixed 1% FPR, with high semantic overlap maintained (Krishna et al., 2023, Schneider et al., 10 Mar 2025).
Homoglyph-based attacks: Substituting visually similar Unicode characters can drop accuracy to random chance with as little as 5% modified text due to tokenizer disruption (Creo et al., 2024).
Limited code detection support: DetectGPT fails for code due to code’s rigid syntax and low variability. Adapted methods using code-specific perturbations and token localization outperform it (Yang et al., 2023, Shi et al., 2024).

5. Subsequent Advances and Comparative Benchmarks

Efficient Variants and Model-Agnosticism

Fast-DetectGPT introduces conditional probability curvature, enabling 340 $\times$ computational speedup and 75% higher AUROC, by analytical token-level sampling rather than full passage perturbations (Bao et al., 2023).
Bayesian surrogate models reduce LLM query budget with uncertainty-guided selection and interpolation, allowing DetectGPT-level AUROC with orders of magnitude fewer forward passes (Miao et al., 2023).
Ensemble methods aggregate scores across multiple DetectGPT classifiers (differing in LLMs or scoring models), raising AUROC from ~0.61 (mismatched) to ~0.73 with summary statistics, and up to 0.94 with supervised ensembling, approaching original DetectGPT when base model is unknown (Ong et al., 2024).

Robustness and Latent-Space Detection

Latent-space detectors targeting event transitions can outperform DetectGPT by up to 31% AUROC, particularly for long-form, narrative, and adversarially-generated contexts (Tian et al., 2024).
UID-based detectors leverage information density variance and offer domain-agnostic, interpretable statistical frameworks, eclipsing DetectGPT by >20% F1 in aggregate evaluations, with ~40% margins in some benchmarks (Venkatraman et al., 2023).
Domain transfer and cross-domain detection: Supervised ranking models, such as RoBERTa-Ranker with lightweight domain tuning, outpace DetectGPT in cross-domain F1 by 10–20 points and achieve superior performance on both in-domain and out-of-domain LLM outputs (Zhou et al., 2024).

Code Detection Adaptations

Adapted for code, DetectGPT is outperformed by methods using fill-in-the-middle code perturbation and surrogate code LMs focused on rightmost token probability, achieving up to 86% AUROC, whereas original DetectGPT approaches random performance (Yang et al., 2023).
Stylized perturbation approaches (e.g., random insertion of spaces/newlines) aligned with syntactic diversity, outperform conventional DetectGPT by leveraging the primary divergence in formatting style between human and machine code (Shi et al., 2024).

6. Adversarial Countermeasures and Societal Implications

Post-hoc paraphrasing using custom or strong paraphrasers can reduce DetectGPT’s successful detection rate to below 10%, all while preserving semantic information (Krishna et al., 2023, Schneider et al., 10 Mar 2025).
RL-based fine-tuning of LLMs targeting classifier evasion can drop detection rates of transformer-based supervised models from >90% to single digits.
Simple manipulation of generation hyperparameters (e.g., moving to higher temperature) can undermine shallow detector performance.
Homoglyph and tokenization attacks systematically defeat DetectGPT by disrupting the underlying statistical signature of the text.

Societal consequences include the diminished reliability of standalone detection and increased complexity in tracking AI-originated misinformation, plagiarism, and regulatory compliance. Several works advocate for watermarking, large-scale retrieval systems, or intrinsic model provenance signals to supplement or supersede statistical detection methods (Krishna et al., 2023, Schneider et al., 10 Mar 2025).

7. Summary Table: DetectGPT Properties and Key Benchmarks

Property	DetectGPT (Original)	Notable Variants	Observed Limitation
Detection Principle	Probability curvature	Conditional curvature, UID	Evaluable only with model log-probabilities
AUROC (typical in-domain)	0.95–0.99 (matched)	Fast-DetectGPT: 0.99+	0.6–0.7 if model mismatch, lower for code
Query cost per passage	$\sim$ 100 (scoring model)	1 (Fast-DetectGPT)	Impractical for commercial LLM APIs
Black-box detection support	Limited (low AUROC)	Ensembling/Aggregation	Fails for code, limited cross-domain
Evasion vulnerability	Paraphrasing: >90%	Robust retrieval-based	Homoglyphs, RL, paraphrasing
Interpretability	Medium	High (UID, latent)	NA
Domain transferability	Moderate (text)	High (with fine-tuning)	Low for code, out-of-domain text

References

DetectGPT: Zero-Shot Machine-Generated Text Detection using Probability Curvature (Mitchell et al., 2023)
Fast-DetectGPT: Efficient Zero-Shot Detection of Machine-Generated Text via Conditional Probability Curvature (Bao et al., 2023)
Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better (Liu et al., 2024)
Efficient Detection of LLM-generated Texts with a Bayesian Surrogate Model (Miao et al., 2023)
Applying Ensemble Methods to Model-Agnostic Machine-Generated Text Detection (Ong et al., 2024)
Detecting AI-Generated Texts in Cross-Domains (Zhou et al., 2024)
GPT-who: An Information Density-based Machine-Generated Text Detector (Venkatraman et al., 2023)
DetectGPT-SC: Improving Detection of Text Generated by LLMs through Self-Consistency with Masked Predictions (Wang et al., 2023)
Paraphrasing evades detectors of AI-generated text, but retrieval is an effective defense (Krishna et al., 2023)
Detection Avoidance Techniques for LLMs (Schneider et al., 10 Mar 2025)
Detecting Machine-Generated Long-Form Content with Latent-Space Variables (Tian et al., 2024)
Zero-Shot Detection of Machine-Generated Codes (Yang et al., 2023)
Between Lines of Code: Unraveling the Distinct Patterns of Machine and Human Programmers (Shi et al., 2024)
SilverSpeak: Evading AI-Generated Text Detectors using Homoglyphs (Creo et al., 2024)