IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements (2310.05484v1)

Published 9 Oct 2023 in cs.CL, cs.CY, and cs.LG

Abstract: Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators.

References (99)

Citations (1)

View on Semantic Scholar

Summary

The paper introduces IDTraffickers, a new authorship attribution dataset of 87,595 text escort ads and 5,244 vendor labels from Backpage.
It establishes an authorship identification benchmark where a DeCLUTR-small model achieved a macro-F1 score of 0.8656.
An authorship verification benchmark was also established, with the DeCLUTR-small model achieving a mean r-precision score of 0.8852.

The paper introduces IDTraffickers, a new authorship attribution dataset designed to help identify and connect potential human trafficking (HT) vendors through online text escort advertisements. The dataset comprises 87,595 text ads and 5,244 vendor labels collected from the Backpage escort market between December 2015 and April 2016 in the United States.

Key aspects of the dataset and the accompanying experiments include:

Dataset Creation and Preprocessing: The dataset was created by merging the title and description of text ads using the "[SEP]" token. Vendor labels were generated by extracting phone numbers from the ads using the TJBatchExtractor and a CNN-LSTM-CRF classifier. NetworkX was used to create vendor communities based on these phone numbers, with each community assigned a unique label ID. To protect privacy, sensitive information such as phone numbers, email addresses, age details, post IDs, dates, and links were masked.
Authorship Identification Task: The paper establishes an authorship identification benchmark using a closed-set classification environment. A DeCLUTR-small model achieved a macro-F1 score of 0.8656, outperforming other transformer-based classifiers like DistilBERT, RoBERTa, and GPT2. The DeCLUTR model's success is attributed to its ability to capture stylometric patterns effectively.
Authorship Verification Task: The research also establishes an authorship verification benchmark using an open-set ranking environment. Style representations extracted from the trained classifier were used to compute cosine similarity between ads. The DeCLUTR-small model achieved a mean r-precision score of 0.8852 in verifying the authorship of escort ads.
Dataset Statistics and Analysis: The analysis of the IDTraffickers dataset reveals a higher frequency of punctuations, emojis, white spaces, proper nouns, and numbers compared to existing authorship datasets like PAN2023 and Reddit-Conversations. Wikification analysis shows a higher level of entities related to locations, escort names, and organizations.
Qualitative Analysis: Local word attributions were computed using the DeCLUTR-small classifier to interpret the contribution of each word to its respective vendor prediction. The analysis revealed that similar writing patterns and word attributions in true positive predictions confirm that the ads are associated with the same vendor. False positive predictions were attributed to significant content and writing style similarities between vendors. Global feature attributions were used to examine the discrepancies in writing styles between different vendors.

The paper also discusses the limitations, broader impacts, and ethical considerations related to the dataset. It suggests that future research could explore larger architectures, supervised contrastive finetuning, and more dependable explainability approaches to enhance model performance and trustworthiness.

PDF Markdown

IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements (2310.05484v1)

Summary

Related Papers