- The paper shows that LLMs enhance detection accuracy by 5.6% when using a ternary classification approach.
- The paper reveals that explanation quality is compromised by inaccurate features, hallucinations, and flawed reasoning.
- The paper emphasizes the need for improved LLM architectures and fine-tuning to achieve more reliable self-detection and explanation.
How Well Can LLMs Detect and Explain Generated Texts?
Introduction
The research paper explores the ability of LLMs to detect and explain their own generated texts across two classification tasks: binary and ternary. It provides comprehensive qualitative and quantitative analyses to reveal the challenges in distinguishing between human-generated texts (HGTs) and LLM-generated texts (LGTs). The experiments were conducted on six open-source and proprietary LLMs of varying sizes, highlighting a consistent self-detection superiority over cross-detection across all models evaluated.
Binary vs. Ternary Classification
The paper delineates two distinct tasks for detection: binary classification, where the model was tasked to categorize texts as either HGTs or LGTs, and ternary classification, adding an "Undecided" category. The introduction of the "Undecided" category in ternary classification showed a statistically significant improvement in both the detection accuracy and explanation quality of LGTs. The experimental results indicate that the ternary classification task enhances the models' ability to distinguish between the nuanced cases of LGTs and HGTs, increasing detection performance by an average of 5.6%.
Challenges in Explanation
Despite noticeable performance improvements in classification tasks, the paper identifies multiple challenges in generating accurate explanations for detected texts. LLMs often failed to provide reliable explanations due to reliance on inaccurate features, hallucinations, and incorrect reasoning—issues that were found to be prevalent in self-detection scenarios.
Inaccurate Features: LLMs frequently misattributed features as being indicative of machine authorship, leading to incorrect classifications. This was particularly evident when models incorrectly labeled complex emotional or logical constructs as inherently human, disregarding the advanced capabilities of present-day LLMs to mimic such depth.
Hallucinations: The models sometimes identified non-existent characteristics or misrepresented text features, further complicating the reliability of their explanations.
Incorrect Reasoning: Even when text features were correctly identified, LLMs showed flawed reasoning processes, leading to incorrect judgments about text origin.
The research featured a benchmark dataset incorporating human annotations to evaluate explanation accuracy, where students manually assessed correctness. Models like GPT-4o demonstrated superior performance across classification tasks, yet frequently struggled with hallucinations in explanations. The results emphasize the critical need for development towards enhancing interpretability and reasoning transparency in LLM detectors.
Implications and Future Work
The results underscore the necessity for ternary classification rather than binary to effectively manage ambiguous text cases. The identification of fundamental limitations in LLM explainability indicates a research path focusing on enhancing explanation reliability — involving not only improvements in LLM architecture but also in fine-tuning techniques. The paper also hints at potential advancements in collaborative LLM systems, where multiple models work in concert to pool reasoning capabilities, reducing explanation errors across diverse datasets.
Conclusion
Improvements in LLM-based detectors, particularly when integrating ternary classification and addressing explanation reliability, demonstrate promise for practical applications in automated content moderation, academic integrity, and misinformation detection. Future strategies should prioritize the refinement of explainability features and cross-detection capabilities, ensuring models remain trustworthy and interpretable to their users.