Detecting fake review buyers using network structure: Direct evidence from Amazon (2410.17507v1)

Published 23 Oct 2024 in cs.SI, econ.GN, physics.soc-ph, and q-fin.EC

Abstract: Online reviews significantly impact consumers' decision-making process and firms' economic outcomes and are widely seen as crucial to the success of online markets. Firms, therefore, have a strong incentive to manipulate ratings using fake reviews. This presents a problem that academic researchers have tried to solve over two decades and on which platforms expend a large amount of resources. Nevertheless, the prevalence of fake reviews is arguably higher than ever. To combat this, we collect a dataset of reviews for thousands of Amazon products and develop a general and highly accurate method for detecting fake reviews. A unique difference between previous datasets and ours is that we directly observe which sellers buy fake reviews. Thus, while prior research has trained models using lab-generated reviews or proxies for fake reviews, we are able to train a model using actual fake reviews. We show that products that buy fake reviews are highly clustered in the product-reviewer network. Therefore, features constructed from this network are highly predictive of which products buy fake reviews. We show that our network-based approach is also successful at detecting fake reviews even without ground truth data, as unsupervised clustering methods can accurately identify fake review buyers by identifying clusters of products that are closely connected in the network. While text or metadata can be manipulated to evade detection, network-based features are more costly to manipulate because these features result directly from the inherent limitations of buying reviews from online review marketplaces, making our detection approach more robust to manipulation.

Citations (14)

View on Semantic Scholar

Summary

The paper demonstrates that network metrics like degree, eigenvector centrality, and PageRank effectively flag fake review buyers on Amazon.
It employs a supervised random forests classifier, validated with logistic regression, SVM, and XGBoost, yielding high AUC, accuracy, and F1 scores.
An unsupervised K-means clustering on a 65,000-node product network uncovers distinct clusters of suspicious review activity, suggesting scalable fraud insights.

Detecting Fake Review Buyers Using Network Structure: Evidence from Amazon

The paper by Sherry He, Brett Hollenbeck, Gijs Overgoor, Davide Proserpio, and Ali Tosyali addresses the identification of fake review buyers on Amazon utilizing network structure analysis. By leveraging a product network formed through shared reviewers, the research introduces a novel approach to discerning fraudulent activity through both supervised and unsupervised learning methodologies.

Network Features and their Computation

The research focuses on developing several network features to model the relationships between products based on shared reviewers. These features include degree, eigenvector centrality, PageRank, and clustering coefficient. The degree of a product is defined by the sum of shared reviewers with other products, while its eigenvector centrality is computed relative to the largest eigenvalue of the adjacency matrix. PageRank further modifies eigenvector centrality by considering the normalization of neighboring products' importance. The clustering coefficient quantifies the degree to which products cluster together.

Feature Analysis and Classification Models

TF-IDF features are computed for textual analysis, and image similarity features are obtained through ResNet-152, a convolutional neural network. The random forests classifier serves as the primary supervised learning model, selected for performance through an extensive parameter tuning process, involving random and grid search cross-validation.

The classifier's evaluation on an 80/20 train-test split reveals strong numerical results, with noteworthy AUC, accuracy, and F1 scores. The area under the ROC curve (AUC) is highlighted as a critical metric, reflecting the balance between true-positive and false-positive rates.

To validate the findings, logistic regression, support vector machine (SVM), and XGBoost classifiers provide additional comparison. Each classifier shows aligned performance across different feature sets, reinforcing the robustness of the detection framework.

Unsupervised Learning Approach

The paper transitions to an unsupervised methodology, utilizing K-means clustering. A larger product network, consisting of 65,000 nodes, is constructed to identify clusters of products with shared review patterns. The clustering algorithm partitions these into 20 distinct groups, revealing significant patterns through standardized feature values.

The network features, particularly degree, PageRank, and eigenvector centrality, demonstrate distinct patterns within clusters suspected of hosting fake review buyers—highlighting their efficacy in fraud detection.

Implications and Future Directions

The implications of this research extend to practical e-commerce operations by providing insights into automated detection of fraudulent activities. The integration of network and content signals presents a comprehensive approach to policing digital marketplaces.

Theoretical implications suggest further exploration into more complex network structures and hybrid models integrating other deep learning techniques. Future developments may focus on extending this methodology to other e-commerce platforms or modifying it to detect evolving fraudulent tactics.

In summary, this paper contributes significant empirical evidence and methodological innovations to the detection of fake review buyers. By embedding network-based features into machine learning models, it offers a rigorous framework that could be pivotal in advancing automated fraud detection mechanisms within online marketplaces.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/net_science/status/1849699815371813192

https://twitter.com/CapybaraPapers/status/1850191940811145508