Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection (2401.12326v1)

Published 22 Jan 2024 in cs.CL and cs.AI

Abstract: SemEval-2024 Task 8 introduces the challenge of identifying machine-generated texts from diverse LLMs in various languages and domains. The task comprises three subtasks: binary classification in monolingual and multilingual (Subtask A), multi-class classification (Subtask B), and mixed text detection (Subtask C). This paper focuses on Subtask A & B. Each subtask is supported by three datasets for training, development, and testing. To tackle this task, two methods: 1) using traditional ML with natural language preprocessing (NLP) for feature extraction, and 2) fine-tuning LLMs for text classification. The results show that transformer models, particularly LoRA-RoBERTa, exceed traditional ML methods in effectiveness, with majority voting being particularly effective in multilingual contexts for identifying machine-generated texts.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (22)

Authors (6)

Feng Xiong (43 papers)
Thanet Markchom (4 papers)
Ziwei Zheng (10 papers)
Subin Jung (1 paper)
Varun Ojha (21 papers)
Huizhi Liang (14 papers)

Citations (2)

View on Semantic Scholar

Fine-tuning Large Language Models for Multigenerator, Multidomain, and Multilingual Machine-Generated Text Detection (2401.12326v1)

Related Papers

Tweets