Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Can tweets predict article retractions? A comparison between human and LLM labelling (2403.16851v2)

Published 25 Mar 2024 in cs.DL, cs.AI, cs.CL, and cs.LG

Abstract: Quickly detecting problematic research articles is crucial to safeguarding the integrity of scientific research. This study explores whether Twitter mentions of retracted articles can signal potential problems with the articles prior to their retraction, potentially serving as an early warning system for scholars. To investigate this, we analysed a dataset of 4,354 Twitter mentions associated with 504 retracted articles. The effectiveness of Twitter mentions in predicting article retractions was evaluated by both manual and LLM labelling. Manual labelling results indicated that 25.7% of tweets signalled problems before retraction. Using the manual labelling results as the baseline, we found that LLMs (GPT-4o-mini, Gemini 1.5 Flash, and Claude-3.5-Haiku) outperformed lexicon-based sentiment analysis tools (e.g., TextBlob) in detecting potential problems, suggesting that automatic detection of problematic articles from social media using LLMs is technically feasible. Nevertheless, since only a small proportion of retracted articles (11.1%) were criticised on Twitter prior to retraction, such automatic systems would detect only a minority of problematic articles. Overall, this study offers insights into how social media data, coupled with emerging generative AI techniques, can support research integrity.

Summary

  • The paper demonstrates that ChatGPT, particularly GPT-4, aligns closely with human judgment in predicting article retractions based on Twitter data.
  • The paper employs a balanced dataset and Coarsened Exact Matching to compare methods including manual labeling, keyword analysis, and classical machine learning models.
  • The paper highlights that while Twitter mentions alone offer limited predictive power, ChatGPT provides a promising tool for enhancing research integrity.

Exploring the Predictive Power of ChatGPT and Twitter Mentions for Article Retraction

Introduction

The landscape of scholarly communication is experiencing a shift, with social media playing an increasingly significant role in disseminating and discussing scientific research. Among various platforms, Twitter has emerged as a pivotal channel for scholarly communication, enabling the rapid spread of research findings and fostering discussions among scientists and the broader public. This pivot to social media platforms introduces a novel approach to identifying problematic research articles that may necessitate retraction.

Altmetric Research on Retracted Articles

The domain of altmetrics, which considers the impact of research beyond traditional citation metrics, has accentuated the potential of Twitter mentions to serve as an early indicator of article retractions. Prior research has demonstrated that retracted articles often garner significant attention on social media, with Twitter being a primary venue for such discussions. This attention not only encompasses the pre-retraction phase but also significantly spikes following retraction announcements, suggesting a correlation between social media discourse and the visibility of problematic research.

Utilizing ChatGPT for Prediction

Given the substantial volume of social media data, manual analysis of Twitter mentions to predict potential article retractions is impractical. Enter ChatGPT, a state-of-the-art LLM by OpenAI, celebrated for its impressive natural language processing capabilities. This paper investigates ChatGPT's utility in analyzing Twitter mentions associated with scholarly articles to predict potential retractions. By drawing on a dataset of both retracted and non-retracted articles, the paper evaluates ChatGPT against traditional human manual labeling, keyword identification, and classical machine learning models.

Methodological Framework

Employing the Coarsened Exact Matching method, the paper contrived a balanced dataset comprising retracted and non-retracted articles to ensure comparability. Twitter mentions were meticulously filtered for relevance and content richness. Four prediction methods were deployed, including manual labeling by human coders, keyword identification based on term frequency-inverse document frequency (TF-IDF) analysis, classical machine learning models (Naive Bayes, Random Forest, Support Vector Machines, and Logistic Regression), and predictions generated by ChatGPT versions 3.5 and 4. Through these methods, the paper sought to discern the extent to which Twitter mentions could reliably predict article retraction.

Findings and Implications

The paper's findings illuminate several key points:

  • Limited Predictive Ability of Twitter Mentions: Only a fraction of retracted articles featured Twitter mentions that evidently signaled impending retractions. This underscores the challenges in relying solely on Twitter data for predicting article retractions.
  • Superior Performance of ChatGPT: Among the evaluated prediction methods, ChatGPT, especially the GPT-4 model, demonstrated a remarkable alignment with human judgment, outperforming both keyword identification and classical machine learning models in predicting article retractions based on Twitter mentions.
  • Potential Applications for Research Integrity: ChatGPT's ability to provide contextually rich predictions that resonate with human evaluators highlights its potential utility in enhancing early warning systems for problematic research articles.

Future Directions

While the paper presents a promising avenue for leveraging generative AI in promoting research integrity, it also identifies areas for future exploration. Incorporating a wider array of social media data, spanning platforms like Facebook and Reddit, could enrich the predictive model. Furthermore, comparing ChatGPT's performance with other LLMs may yield insights into optimizing prediction accuracy and reliability.

Conclusions

The integration of generative AI, exemplified by ChatGPT, into the analysis of social media discourse surrounding scholarly articles offers a novel approach to early detection of problematic research. The paper's findings advocate for the adoption of these advanced tools in monitoring research integrity, albeit with an awareness of their limitations and potential for refinement through broader data sources and comparative model analyses.