Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection (2401.12326v1)
Abstract: SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse LLMs in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional ML with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining, pages 785–794.
- Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Roft: A tool for evaluating human detection of machine-generated text. arXiv preprint arXiv:2010.03070.
- Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043.
- How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel.
- A watermark for large language models. arXiv preprint arXiv:2301.10226.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305.
- Crosslingual generalization through multitask finetuning. arXiv preprint arXiv:2211.01786.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. ArXiv, abs/1910.01108.
- Brian Scott. 2023. The gunning’s fog index (or fog) readability formula.
- Rexhep Shijaku and Ercan Canhasi. 2023. Chatgpt generated text detection. Publisher: Unpublished.
- The science of detecting llm-generated texts. arXiv preprint arXiv:2303.07205.
- Intrinsic dimension estimation for robust detection of ai-generated texts. arXiv preprint arXiv:2306.04723.
- M4: Multi-generator, multi-domain, and multi-lingual black-box machine-generated text detection. arXiv preprint arXiv:2305.14902.
- Wataru Zaitsu and Mingzhe Jin. 2023. Distinguishing chatgpt (-3.5,-4)-generated and human-written papers through japanese stylometric analysis. arXiv preprint arXiv:2304.05534.
- Defending against neural fake news. Advances in neural information processing systems, 32.
- Semi-supervised url segmentation with recurrent neural networks pre-trained on knowledge graph entities. arXiv preprint arXiv:2011.03138.
- Feng Xiong (43 papers)
- Thanet Markchom (4 papers)
- Ziwei Zheng (10 papers)
- Subin Jung (1 paper)
- Varun Ojha (21 papers)
- Huizhi Liang (14 papers)