IDTraffickers: An Authorship Attribution Dataset to link and connect Potential Human-Trafficking Operations on Text Escort Advertisements (2310.05484v1)
Abstract: Human trafficking (HT) is a pervasive global issue affecting vulnerable individuals, violating their fundamental human rights. Investigations reveal that a significant number of HT cases are associated with online advertisements (ads), particularly in escort markets. Consequently, identifying and connecting HT vendors has become increasingly challenging for Law Enforcement Agencies (LEAs). To address this issue, we introduce IDTraffickers, an extensive dataset consisting of 87,595 text ads and 5,244 vendor labels to enable the verification and identification of potential HT vendors on online escort markets. To establish a benchmark for authorship identification, we train a DeCLUTR-small model, achieving a macro-F1 score of 0.8656 in a closed-set classification environment. Next, we leverage the style representations extracted from the trained classifier to conduct authorship verification, resulting in a mean r-precision score of 0.8852 in an open-set ranking environment. Finally, to encourage further research and ensure responsible data sharing, we plan to release IDTraffickers for the authorship attribution task to researchers under specific conditions, considering the sensitive nature of the data. We believe that the availability of our dataset and benchmarks will empower future researchers to utilize our findings, thereby facilitating the effective linkage of escort ads and the development of more robust approaches for identifying HT indicators.
- Authorship clustering using tf-idf weighted word-embeddings. In Proceedings of the 11th Forum for Information Retrieval Evaluation, FIRE ’19, page 24–29, New York, NY, USA. Association for Computing Machinery.
- Whodunit? learning to contrast for authorship attribution.
- Malicious spam emails developments and authorship attribution. pages 58–68.
- Semi-supervised learning for detecting human trafficking. Security Informatics, 6(1):1.
- A non-parametric learning approach to identify online human trafficking. In 2016 IEEE Conference on Intelligence and Security Informatics (ISI), pages 133–138.
- Nicholas Andrews and Marcus Bishop. 2019. Learning invariant representations of social media users. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 1684–1695, Hong Kong, China. Association for Computational Linguistics.
- ReFinED: An efficient zero-shot-capable approach to end-to-end entity linking. In NAACL.
- Georgios Barlas and Efstathios Stamatatos. 2021. A transfer learning approach to cross-domain authorship attribution. Evolving Systems, 12(3):625–643.
- Average R-Precision, pages 195–195. Springer US, Boston, MA.
- Forensic authorship analysis of microblogging texts using n-grams and stylometric features.
- Overview of pan 2023: Authorship verification, multi-author writing style analysis, profiling cryptocurrency influencers, and trigger detection: Extended abstract. In Advances in Information Retrieval: 45th European Conference on Information Retrieval, ECIR 2023, Dublin, Ireland, April 2–6, 2023, Proceedings, Part III, page 518–526, Berlin, Heidelberg. Springer-Verlag.
- Shared tasks on authorship analysis at pan 2020. In Advances in Information Retrieval, pages 508–516, Cham. Springer International Publishing.
- Lukas Biewald. 2020. Experiment tracking with weights and biases. Software available from wandb.com.
- Backpage.com’s knowing facilation of online sex trafficking. United States Senate.
- Comparing manual and computational approaches to theme identification in online forums: A case study of a sex work special interest community. Methods in Psychology, 5:100065.
- Character-based models for adversarial phone extraction: Preventing human sex trafficking. In Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019), pages 48–56, Hong Kong, China. Association for Computational Linguistics.
- Alexis Conneau and Douwe Kiela. 2018. Senteval: An evaluation toolkit for universal sentence representations. arXiv preprint arXiv:1803.05449.
- Arun Das and Paul Rad. 2020. Opportunities and challenges in explainable artificial intelligence (xai): A survey.
- Operations research and analytics to combat human trafficking: A systematic review of academic literature. PLOS ONE, 17(8):e0273708.
- Leveraging publicly available data to discern patterns of human-trafficking activity. Journal of Human Trafficking, 1(1):65–85.
- Anirudh Ekambaranathan. 2018. Using stylometry to track cybercriminals in darknet forums.
- EUROPOL. 2020. The challenges of countering human trafficking in the digital era.
- BertAA : BERT fine-tuning for authorship attribution. In Proceedings of the 17th International Conference on Natural Language Processing (ICON), pages 127–137, Indian Institute of Technology Patna, Patna, India. NLP Association of India (NLPAI).
- William Falcon and The PyTorch Lightning team. 2019. PyTorch Lightning.
- Brian Fichtner. 2016. California department of justice report. Superior court of the state of California.
- Simcse: Simple contrastive learning of sentence embeddings.
- Declutr: Deep contrastive learning for unsupervised textual representations.
- Authorship identification using recurrent neural networks. In Proceedings of the 2019 3rd International Conference on Information System and Data Mining, ICISDM 2019, page 133–137, New York, NY, USA. Association for Computing Machinery.
- Exploring network structure, dynamics, and function using networkx. In Proceedings of the 7th Python in Science Conference, pages 11 – 15, Pasadena, CA USA.
- Representation learning of writing style. In Proceedings of the Sixth Workshop on Noisy User-generated Text (W-NUT 2020), pages 232–243, Online. Association for Computational Linguistics.
- Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing.
- Michelle Ibanez and Rich Gazan. 2016. Virtual indicators of sex trafficking to identify potential victims in online advertisements. In 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), pages 818–824.
- Michelle Ibanez and Dan Suthers. 2016. Detecting covert sex trafficking networks in virtual markets. pages 876–879.
- Michelle Ibanez and Daniel D. Suthers. 2014. Detection of domestic human trafficking indicators and movement trends using content available on open internet sources. In 2014 47th Hawaii International Conference on System Sciences, pages 1556–1565.
- ILO. 2012. Ilo global estimate of forced labour.
- Syntactic recurrent neural network for authorship attribution.
- On estimating recommendation evaluation metrics under sampling.
- Fredrik Johansson and Tim Isbister. 2019. Foi cross-domain authorship attribution for criminal investigations. In Conference and Labs of the Evaluation Forum.
- Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 7(3):535–547.
- Are you robert or roberta? deceiving online authorship attribution models using neural text generators. Proceedings of the International AAAI Conference on Web and Social Media, 16(1):429–440.
- Mayank Kejriwal and Rahul Kapoor. 2019. Network-theoretic information extraction quality assessment in the human trafficking domain. Applied Network Science, 4(1):44.
- Mayank Kejriwal and Pedro Szekely. 2022. Knowledge graphs for social good: An entity-centric search engine for the human trafficking domain. IEEE Transactions on Big Data, 8(3):592–606.
- Cracking sex trafficking: Data analysis, pattern recognition, and path prediction. Production and Operations Management, 30(4):1110–1135.
- Captum: A unified and generic model interpretability library for pytorch.
- Interdicting restructuring networks with applications in illicit trafficking.
- The disagreement problem in explainable machine learning: A practitioner’s perspective.
- Vlad Krotov and Leiser Silva. 2018. Legality and ethics of web scraping.
- Edarkfind: Unsupervised multi-view learning for sybil account detection. In Proceedings of The Web Conference 2020, WWW ’20, page 1955–1965, New York, NY, USA. Association for Computing Machinery.
- Quantifying the carbon emissions of machine learning. CoRR, abs/1910.09700.
- Albert: A lite bert for self-supervised learning of language representations.
- Infoshield: Generalizable information-theoretic human-trafficking detection. In 2021 IEEE 37th International Conference on Data Engineering (ICDE), pages 1116–1127.
- A short study on compressing decoder-based language models.
- Extracting person names from user generated text: Named-entity recognition for combating human trafficking. In Findings of the Association for Computational Linguistics: ACL 2022, pages 2854–2868, Dublin, Ireland. Association for Computational Linguistics.
- Roberta: A robustly optimized bert pretraining approach.
- Kristina Lugo-Graulich. Indicators of sex trafficking in online escort ads. https://www.ojp.gov/pdffiles1/nij/grants/305453.pdf.
- Kristina Lugo-Graulich and Leah F. Meyer. 2021. Law enforcement guide on indicators of sex trafficking in online escort ads. Justice Research and Statistics Association.
- Veridark: A large-scale benchmark for authorship verification on the dark web.
- explosion/spaCy: v3.0.0rc: Transformer-based pipelines, new training system, project templates, custom models, improved component API, type hints & lots more.
- An entity resolution approach to isolate instances of human trafficking online. In Proceedings of the 3rd Workshop on Noisy User-generated Text, pages 77–84, Copenhagen, Denmark. Association for Computational Linguistics.
- An entity resolution approach to isolate instances of human trafficking online.
- Vispad: Visualization and pattern discovery for fighting human trafficking. Companion Proceedings of the Web Conference 2022.
- Will longformers pan out for authorship verification? notebook for pan at clef 2020. In CLEF.
- Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, pages 8024–8035. Curran Associates, Inc.
- Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.
- Charles Pierse. 2021. Transformers Interpret.
- POLARIS. 2018. Human trafficking statistics.
- POLARIS. 2020. Polaris analysis of 2020 data from the national human trafficking hotline.
- Backpage and bitcoin: Uncovering human traffickers. pages 1595–1604.
- Precision at k in multilingual information retrieval. International Journal of Computer Applications, 24.
- Who am i? analyzing digital personas in cybercrime investigations. Computer, 46(4):54–61.
- Tx-ray: Quantifying and explaining model-knowledge transfer in (un-)supervised nlp.
- Dylan Rhodes. 2015. Author attribution with cnn’s.
- Learning universal authorship representations. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 913–919, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.
- Vendorlink: An nlp approach for identifying & linking vendor migrants & potential aliases on darknet markets.
- Context-specific language modeling for human trafficking detection from online advertisements. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1180–1184, Florence, Italy. Association for Computational Linguistics.
- Convolutional neural networks for authorship attribution of short texts. In Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 669–674, Valencia, Spain. Association for Computational Linguistics.
- Traffickcam: Crowdsourced and computer vision based approaches to fighting sex trafficking. In 2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR), pages 1–8.
- Building and using a knowledge graph to combat human trafficking. In The Semantic Web - ISWC 2015, pages 205–221, Cham. Springer International Publishing.
- Julian Szymanski and Maciej Naruszewicz. 2019. Review on wikification methods. AI Communications, 32:1–17.
- Adversarial matching of dark net market vendor accounts. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’19, page 1871–1880, New York, NY, USA. Association for Computing Machinery.
- Combating human trafficking with multimodal deep models. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1547–1556, Vancouver, Canada. Association for Computational Linguistics.
- Authorship attribution for neural text generation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8384–8395, Online. Association for Computational Linguistics.
- UNDOC. 2020. Global report on trafficking in persons.
- Trafficvis: Visualizing organized activity and spatio-temporal patterns for detecting and labeling human trafficking. IEEE Transactions on Visualization & Computer Graphics, 29(01):53–62.
- Deltashield: Information theory for human- trafficking detection. ACM Transactions on Knowledge Discovery from Data, 17:1 – 27.
- Guido Van Rossum and Fred L Drake Jr. 1995. Python reference manual. Centrum voor Wiskunde en Informatica Amsterdam.
- Open-set recognition: A good closed-set classifier is all you need. In International Conference on Learning Representations.
- Sex trafficking detection with ordinal regression neural networks. ArXiv, abs/1908.05434.
- Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers.
- Same author or just same topic? towards content-independent style representations. In Proceedings of the 7th Workshop on Representation Learning for NLP, pages 249–268, Dublin, Ireland. Association for Computational Linguistics.
- Don’t want to get caught? don’t say it: The use of emojis in online human sex trafficking ads. In Hawaii International Conference on System Sciences.
- Chawit Wiriyakun and Werasak Kurutach. 2021. Feature selection for human trafficking detection models. In 2021 IEEE/ACIS 20th International Fall Conference on Computer and Information Science (ICIS Fall), pages 131–135.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Min Yang and K.P. Chow. 2014. Authorship attribution for forensic investigation with thousands of authors. volume 428, pages 339–350.
- Research on authorship attribution of article fragments via rnns. In 2018 IEEE 9th International Conference on Software Engineering and Service Science (ICSESS), pages 156–159.
- Authorship analysis in cybercrime investigation. In Intelligence and Security Informatics, pages 59–73, Berlin, Heidelberg. Springer Berlin Heidelberg.
- Jian Zhu and David Jurgens. 2021. Idiosyncratic but not arbitrary: Learning idiolects in online registers reveals distinctive yet consistent individual styles. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 279–297, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Authorship attribution. 2007 22nd international symposium on computer and information sciences, pages 1–5.