Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Real or Fake? Learning to Discriminate Machine from Human Generated Text (1906.03351v2)

Published 7 Jun 2019 in cs.LG, cs.CL, and stat.ML

Abstract: Energy-based models (EBMs), a.k.a. un-normalized models, have had recent successes in continuous spaces. However, they have not been successfully applied to model text sequences. While decreasing the energy at training samples is straightforward, mining (negative) samples where the energy should be increased is difficult. In part, this is because standard gradient-based methods are not readily applicable when the input is high-dimensional and discrete. Here, we side-step this issue by generating negatives using pre-trained auto-regressive LLMs. The EBM then works in the residual of the LLM; and is trained to discriminate real text from text generated by the auto-regressive models. We investigate the generalization ability of residual EBMs, a pre-requisite for using them in other applications. We extensively analyze generalization for the task of classifying whether an input is machine or human generated, a natural task given the training loss and how we mine negatives. Overall, we observe that EBMs can generalize remarkably well to changes in the architecture of the generators producing negatives. However, EBMs exhibit more sensitivity to the training set used by such generators.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Anton Bakhtin (16 papers)
  2. Sam Gross (9 papers)
  3. Myle Ott (33 papers)
  4. Yuntian Deng (44 papers)
  5. Marc'Aurelio Ranzato (53 papers)
  6. Arthur Szlam (86 papers)
Citations (152)

Summary

We haven't generated a summary for this paper yet.