Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection (2401.12326v1)

Published 22 Jan 2024 in cs.CL and cs.AI

Abstract: SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse LLMs in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional ML with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  2. Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
  3. Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
  4. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  5. Roft: A tool for evaluating human detection of machine-generated text. arXiv preprint arXiv:2010.03070.
  6. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043.
  7. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
  8. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  9. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.
  10. A watermark for large language models. arXiv preprint arXiv:2301.10226.
  11. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  12. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
  13. Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
  14. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
  15. Brian Scott. 2023. The gunning’s fog index (or fog) readability formula.
  16. Rexhep Shijaku and Ercan Canhasi. 2023. Chatgpt generated text detection. Publisher: Unpublished.
  17. The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205.
  18. Intrinsic dimension estimation for robust detection of ai-generated texts. arXiv preprint arXiv:2306.04723.
  19. M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. arXiv preprint arXiv:2305.14902.
  20. Wataru Zaitsu and Mingzhe Jin. 2023. Distinguishing chatgpt (-3.5,-4)-generated and human-written papers through japanese stylometric analysis. arXiv preprint arXiv:2304.05534.
  21. Defending against neural fake news. Advances in neural information processing systems, 32.
  22. Semi-supervised url segmentation with recurrent neural networks pre-trained on knowledge graph entities. arXiv preprint arXiv:2011.03138.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Feng Xiong (43 papers)
  2. Thanet Markchom (4 papers)
  3. Ziwei Zheng (10 papers)
  4. Subin Jung (1 paper)
  5. Varun Ojha (21 papers)
  6. Huizhi Liang (14 papers)
Citations (2)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets