Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
140 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements (2310.05484v1)

Published 9 Oct 2023 in cs.CL, cs.CY, and cs.LG

Abstract: Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (99)
  1. Authorship clustering using tf-idf weighted word-embeddings. In Proceedings of the 11th Forum for Information Retrieval Evaluation, FIRE ’19, page 24–29, New York, NY, USA. Association for Computing Machinery.
  2. Whodunit? learning to contrast for authorship attribution.
  3. Malicious spam emails developments and authorship attribution. pages 58–68.
  4. Semi-supervised learning for detecting human trafficking. Security Informatics, 6(1):1.
  5. A non-parametric learning approach to identify online human trafficking. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 133–138.
  6. Nicholas Andrews and Marcus Bishop. 2019. Learning invariant representations of social media users. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1684–1695, Hong Kong, China. Association for Computational Linguistics.
  7. ReFinED: An efficient zero-shot-capable approach to end-to-end entity linking. In NAACL.
  8. Georgios Barlas and Efstathios Stamatatos. 2021. A transfer learning approach to cross-domain authorship attribution. Evolving Systems, 12(3):625–643.
  9. Average R-Precision, pages 195–195. Springer US, Boston, MA.
  10. Forensic authorship analysis of microblogging texts using n-grams and stylometric features.
  11. Overview of pan 2023: Authorship verification, multi-author writing style analysis, profiling cryptocurrency influencers, and trigger detection: Extended abstract. In Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part III, page 518–526, Berlin, Heidelberg. Springer-Verlag.
  12. Shared tasks on authorship analysis at pan 2020. In Advances in Information Retrieval, pages 508–516, Cham. Springer International Publishing.
  13. Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
  14. Backpage.com’s knowing facilation of online sex trafficking. United States Senate.
  15. Comparing manual and computational approaches to theme identification in online forums: A case study of a sex work special interest community. Methods in Psychology, 5:100065.
  16. Character-based models for adversarial phone extraction: Preventing human sex trafficking. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 48–56, Hong Kong, China. Association for Computational Linguistics.
  17. Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449.
  18. Arun Das and Paul Rad. 2020. Opportunities and challenges in explainable artificial intelligence (xai): A survey.
  19. Operations research and analytics to combat human trafficking: A systematic review of academic literature. PLOS ONE, 17(8):e0273708.
  20. Leveraging publicly available data to discern patterns of human-trafficking activity. Journal of Human Trafficking, 1(1):65–85.
  21. Anirudh Ekambaranathan. 2018. Using stylometry to track cybercriminals in darknet forums.
  22. EUROPOL. 2020. The challenges of countering human trafficking in the digital era.
  23. BertAA : BERT fine-tuning for authorship attribution. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 127–137, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
  24. William Falcon and The PyTorch Lightning team. 2019. PyTorch Lightning.
  25. Brian Fichtner. 2016. California department of justice report. Superior court of the state of California.
  26. Simcse: Simple contrastive learning of sentence embeddings.
  27. Declutr: Deep contrastive learning for unsupervised textual representations.
  28. Authorship identification using recurrent neural networks. In Proceedings of the 2019 3rd International Conference on Information System and Data Mining, ICISDM 2019, page 133–137, New York, NY, USA. Association for Computing Machinery.
  29. Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA.
  30. Representation learning of writing style. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 232–243, Online. Association for Computational Linguistics.
  31. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
  32. Michelle Ibanez and Rich Gazan. 2016. Virtual indicators of sex trafficking to identify potential victims in online advertisements. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 818–824.
  33. Michelle Ibanez and Dan Suthers. 2016. Detecting covert sex trafficking networks in virtual markets. pages 876–879.
  34. Michelle Ibanez and Daniel D. Suthers. 2014. Detection of domestic human trafficking indicators and movement trends using content available on open internet sources. In 2014 47th Hawaii International Conference on System Sciences, pages 1556–1565.
  35. ILO. 2012. Ilo global estimate of forced labour.
  36. Syntactic recurrent neural network for authorship attribution.
  37. On estimating recommendation evaluation metrics under sampling.
  38. Fredrik Johansson and Tim Isbister. 2019. Foi cross-domain authorship attribution for criminal investigations. In Conference and Labs of the Evaluation Forum.
  39. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
  40. Are you robert or roberta? deceiving online authorship attribution models using neural text generators. Proceedings of the International AAAI Conference on Web and Social Media, 16(1):429–440.
  41. Mayank Kejriwal and Rahul Kapoor. 2019. Network-theoretic information extraction quality assessment in the human trafficking domain. Applied Network Science, 4(1):44.
  42. Mayank Kejriwal and Pedro Szekely. 2022. Knowledge graphs for social good: An entity-centric search engine for the human trafficking domain. IEEE Transactions on Big Data, 8(3):592–606.
  43. Cracking sex trafficking: Data analysis, pattern recognition, and path prediction. Production and Operations Management, 30(4):1110–1135.
  44. Captum: A unified and generic model interpretability library for pytorch.
  45. Interdicting restructuring networks with applications in illicit trafficking.
  46. The disagreement problem in explainable machine learning: A practitioner’s perspective.
  47. Vlad Krotov and Leiser Silva. 2018. Legality and ethics of web scraping.
  48. Edarkfind: Unsupervised multi-view learning for sybil account detection. In Proceedings of The Web Conference 2020, WWW ’20, page 1955–1965, New York, NY, USA. Association for Computing Machinery.
  49. Quantifying the carbon emissions of machine learning. CoRR, abs/1910.09700.
  50. Albert: A lite bert for self-supervised learning of language representations.
  51. Infoshield: Generalizable information-theoretic human-trafficking detection. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1116–1127.
  52. A short study on compressing decoder-based language models.
  53. Extracting person names from user generated text: Named-entity recognition for combating human trafficking. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2854–2868, Dublin, Ireland. Association for Computational Linguistics.
  54. Roberta: A robustly optimized bert pretraining approach.
  55. Kristina Lugo-Graulich. Indicators of sex trafficking in online escort ads. https://www.ojp.gov/pdffiles1/nij/grants/305453.pdf.
  56. Kristina Lugo-Graulich and Leah F. Meyer. 2021. Law enforcement guide on indicators of sex trafficking in online escort ads. Justice Research and Statistics Association.
  57. Veridark: A large-scale benchmark for authorship verification on the dark web.
  58. explosion/spaCy: v3.0.0rc: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more.
  59. An entity resolution approach to isolate instances of human trafficking online. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 77–84, Copenhagen, Denmark. Association for Computational Linguistics.
  60. An entity resolution approach to isolate instances of human trafficking online.
  61. Vispad: Visualization and pattern discovery for fighting human trafficking. Companion Proceedings of the Web Conference 2022.
  62. Will longformers pan out for authorship verification? notebook for pan at clef 2020. In CLEF.
  63. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
  64. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  65. Charles Pierse. 2021. Transformers Interpret.
  66. POLARIS. 2018. Human trafficking statistics.
  67. POLARIS. 2020. Polaris analysis of 2020 data from the national human trafficking hotline.
  68. Backpage and bitcoin: Uncovering human traffickers. pages 1595–1604.
  69. Precision at k in multilingual information retrieval. International Journal of Computer Applications, 24.
  70. Who am i? analyzing digital personas in cybercrime investigations. Computer, 46(4):54–61.
  71. Tx-ray: Quantifying and explaining model-knowledge transfer in (un-)supervised nlp.
  72. Dylan Rhodes. 2015. Author attribution with cnn’s.
  73. Learning universal authorship representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 913–919, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  74. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
  75. Vendorlink: An nlp approach for identifying & linking vendor migrants & potential aliases on darknet markets.
  76. Context-specific language modeling for human trafficking detection from online advertisements. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1180–1184, Florence, Italy. Association for Computational Linguistics.
  77. Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 669–674, Valencia, Spain. Association for Computational Linguistics.
  78. Traffickcam: Crowdsourced and computer vision based approaches to fighting sex trafficking. In 2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pages 1–8.
  79. Building and using a knowledge graph to combat human trafficking. In The Semantic Web - ISWC 2015, pages 205–221, Cham. Springer International Publishing.
  80. Julian Szymanski and Maciej Naruszewicz. 2019. Review on wikification methods. AI Communications, 32:1–17.
  81. Adversarial matching of dark net market vendor accounts. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’19, page 1871–1880, New York, NY, USA. Association for Computing Machinery.
  82. Combating human trafficking with multimodal deep models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1547–1556, Vancouver, Canada. Association for Computational Linguistics.
  83. Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Online. Association for Computational Linguistics.
  84. UNDOC. 2020. Global report on trafficking in persons.
  85. Trafficvis: Visualizing organized activity and spatio-temporal patterns for detecting and labeling human trafficking. IEEE Transactions on Visualization & Computer Graphics, 29(01):53–62.
  86. Deltashield: Information theory for human- trafficking detection. ACM Transactions on Knowledge Discovery from Data, 17:1 – 27.
  87. Guido Van Rossum and Fred L Drake Jr. 1995. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
  88. Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations.
  89. Sex trafficking detection with ordinal regression neural networks. ArXiv, abs/1908.05434.
  90. Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.
  91. Same author or just same topic? towards content-independent style representations. In Proceedings of the 7th Workshop on Representation Learning for NLP, pages 249–268, Dublin, Ireland. Association for Computational Linguistics.
  92. Don’t want to get caught? don’t say it: The use of emojis in online human sex trafficking ads. In Hawaii International Conference on System Sciences.
  93. Chawit Wiriyakun and Werasak Kurutach. 2021. Feature selection for human trafficking detection models. In 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall), pages 131–135.
  94. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
  95. Min Yang and K.P. Chow. 2014. Authorship attribution for forensic investigation with thousands of authors. volume 428, pages 339–350.
  96. Research on authorship attribution of article fragments via rnns. In 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), pages 156–159.
  97. Authorship analysis in cybercrime investigation. In Intelligence and Security Informatics, pages 59–73, Berlin, Heidelberg. Springer Berlin Heidelberg.
  98. Jian Zhu and David Jurgens. 2021. Idiosyncratic but not arbitrary: Learning idiolects in online registers reveals distinctive yet consistent individual styles. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 279–297, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  99. Authorship attribution. 2007 22nd international symposium on computer and information sciences, pages 1–5.
Citations (1)

Summary

  • The paper introduces IDTraffickers, a new authorship attribution dataset of 87,595 text escort ads and 5,244 vendor labels from Backpage.
  • It establishes an authorship identification benchmark where a DeCLUTR-small model achieved a macro-F1 score of 0.8656.
  • An authorship verification benchmark was also established, with the DeCLUTR-small model achieving a mean r-precision score of 0.8852.

The paper introduces IDTraffickers, a new authorship attribution dataset designed to help identify and connect potential human trafficking (HT) vendors through online text escort advertisements. The dataset comprises 87,595 text ads and 5,244 vendor labels collected from the Backpage escort market between December 2015 and April 2016 in the United States.

Key aspects of the dataset and the accompanying experiments include:

  • Dataset Creation and Preprocessing: The dataset was created by merging the title and description of text ads using the "[SEP]" token. Vendor labels were generated by extracting phone numbers from the ads using the TJBatchExtractor and a CNN-LSTM-CRF classifier. NetworkX was used to create vendor communities based on these phone numbers, with each community assigned a unique label ID. To protect privacy, sensitive information such as phone numbers, email addresses, age details, post IDs, dates, and links were masked.
  • Authorship Identification Task: The paper establishes an authorship identification benchmark using a closed-set classification environment. A DeCLUTR-small model achieved a macro-F1 score of 0.8656, outperforming other transformer-based classifiers like DistilBERT, RoBERTa, and GPT2. The DeCLUTR model's success is attributed to its ability to capture stylometric patterns effectively.
  • Authorship Verification Task: The research also establishes an authorship verification benchmark using an open-set ranking environment. Style representations extracted from the trained classifier were used to compute cosine similarity between ads. The DeCLUTR-small model achieved a mean r-precision score of 0.8852 in verifying the authorship of escort ads.
  • Dataset Statistics and Analysis: The analysis of the IDTraffickers dataset reveals a higher frequency of punctuations, emojis, white spaces, proper nouns, and numbers compared to existing authorship datasets like PAN2023 and Reddit-Conversations. Wikification analysis shows a higher level of entities related to locations, escort names, and organizations.
  • Qualitative Analysis: Local word attributions were computed using the DeCLUTR-small classifier to interpret the contribution of each word to its respective vendor prediction. The analysis revealed that similar writing patterns and word attributions in true positive predictions confirm that the ads are associated with the same vendor. False positive predictions were attributed to significant content and writing style similarities between vendors. Global feature attributions were used to examine the discrepancies in writing styles between different vendors.

The paper also discusses the limitations, broader impacts, and ethical considerations related to the dataset. It suggests that future research could explore larger architectures, supervised contrastive finetuning, and more dependable explainability approaches to enhance model performance and trustworthiness.