Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model Attribution in LLM-Generated Disinformation: A Domain Generalization Approach with Supervised Contrastive Learning (2407.21264v2)

Published 31 Jul 2024 in cs.CL

Abstract: Model attribution for LLM-generated disinformation poses a significant challenge in understanding its origins and mitigating its spread. This task is especially challenging because modern LLMs produce disinformation with human-like quality. Additionally, the diversity in prompting methods used to generate disinformation complicates accurate source attribution. These methods introduce domain-specific features that can mask the fundamental characteristics of the models. In this paper, we introduce the concept of model attribution as a domain generalization problem, where each prompting method represents a unique domain. We argue that an effective attribution model must be invariant to these domain-specific features. It should also be proficient in identifying the originating models across all scenarios, reflecting real-world detection challenges. To address this, we introduce a novel approach based on Supervised Contrastive Learning. This method is designed to enhance the model's robustness to variations in prompts and focuses on distinguishing between different source LLMs. We evaluate our model through rigorous experiments involving three common prompting methods: open-ended'',rewriting'', and paraphrasing'', and three advanced LLMs:llama 2'', chatgpt'', andvicuna''. Our results demonstrate the effectiveness of our approach in model attribution tasks, achieving state-of-the-art performance across diverse and unseen datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (62)
  1. C. Chen and K. Shu, “Can llm-generated misinformation be detected?” arXiv preprint arXiv:2309.13788, 2023.
  2. B. Jiang, C. Zhao, Z. Tan, and H. Liu, “Catching chameleons: Detecting evolving disinformation generated using large language models,” arXiv preprint arXiv:2406.17992, 2024.
  3. B. Jiang, L. Cheng, Z. Tan, R. Guo, and H. Liu, “Media bias matters: Understanding the impact of politically biased news on vaccine attitudes in social media,” arXiv preprint arXiv:2403.04009, 2024.
  4. D. Fallis, “What is disinformation?” Library trends, vol. 63, no. 3, pp. 401–426, 2015.
  5. L. Wu, F. Morstatter, K. M. Carley, and H. Liu, “Misinformation in social media: definition, manipulation, and detection,” ACM SIGKDD explorations newsletter, vol. 21, no. 2, pp. 80–90, 2019.
  6. B. Jiang, Z. Tan, A. Nirmal, and H. Liu, “Disinformation detection: An evolving challenge in the age of llms,” in Proceedings of the 2024 SIAM International Conference on Data Mining (SDM).   SIAM, 2024, pp. 427–435.
  7. B. Huang, C. Chen, and K. Shu, “Can large language models identify authorship?” arXiv preprint arXiv:2403.08213, 2024.
  8. T. Kumarage, G. Agrawal, P. Sheth, R. Moraffah, A. Chadha, J. Garland, and H. Liu, “A survey of ai-generated text forensic systems: Detection, attribution, and characterization,” arXiv preprint arXiv:2403.01152, 2024.
  9. M. Arbel, D. J. Sutherland, M. Bińkowski, and A. Gretton, “Maximum mean discrepancy gradient flow,” in Advances in Neural Information Processing Systems, 2019, pp. 12 054–12 064.
  10. S. J. Pan, I. W.-H. Tsang, J. T. Kwok, and Q. Yang, “Spectral feature alignment for unsupervised domain adaptation,” arXiv preprint arXiv:0909.0507, 2010.
  11. K. M. Borgwardt, A. Gretton, M. J. Rasch, H.-P. Kriegel, B. Schölkopf, and A. J. Smola, “Integrating structured biological data by kernel maximum mean discrepancy,” in International Conference on Intelligent Systems for Molecular Biology, 2006, pp. 49–57.
  12. Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and V. Lempitsky, “Domain-adversarial training of neural networks,” in Journal of Machine Learning Research, 2016, pp. 2096–2030.
  13. M. Liu, H. Xu, D. Tao, and M. Xu, “Multi-task adversarial network for disentangled feature learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 3743–3751.
  14. X. Chen, L. Zhang, D. Zhang, M. Wu, Q. Xie, M. Sun, and J. Xiao, “Adversarial deep ensemble: Evasion attacks and defenses for malware detection,” in Proceedings of the 27th International Joint Conference on Artificial Intelligence, 2018, pp. 4318–4324.
  15. T. Kumarage, A. Bhattacharjee, D. Padejski, K. Roschke, D. Gillmor, S. Ruston, H. Liu, and J. Garland, “J-guard: Journalism guided adversarially robust detection of ai-generated news,” arXiv preprint arXiv:2309.03164, 2023.
  16. Q. Ye, Z. Zhang, H. Dai, J. Xu, and H. Li, “Towards better concept transfer in multi-domain sentiment classification,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 1077–1085.
  17. A. Bhattacharjee, T. Kumarage, R. Moraffah, and H. Liu, “Conda: Contrastive domain adaptation for ai-generated text detection,” arXiv preprint arXiv:2309.03992, 2023.
  18. D. Li, Y. Yang, Y.-Z. Song, and T. M. Hospedales, “Extracting domain-invariant features with domain adversarial networks,” in Proceedings of the European Conference on Computer Vision, 2018, pp. 589–605.
  19. Z. Wu and M. Huang, “Multi-source domain adaptation for neural sentiment classification,” in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, 2016, pp. 2255–2264.
  20. K. Muandet, D. Balduzzi, and B. Schölkopf, “Domain generalization via invariant feature representation,” International Conference on Machine Learning, pp. 10–18, 2013.
  21. P. Ahadian, Y. Feng, K. Kosko, R. Ferdig, and Q. Guan, “Mnist-fraction: Enhancing math education with ai-driven fraction detection and analysis,” in Proceedings of the 2024 ACM Southeast Conference, 2024, pp. 284–290.
  22. H. Wang, Q. Yao, D. Ramanan, and L.-P. Morency, “Learning robust representations by projecting superfluous dimensions to noise,” in Proceedings of the 37th International Conference on Machine Learning, 2021, pp. 10 010–10 020.
  23. M. Arjovsky, L. Bottou, I. Gulrajani, and D. Lopez-Paz, “Invariant risk minimization,” arXiv preprint arXiv:1907.02893, 2019.
  24. Q. Tan, R. He, L. Bing, and H. T. Ng, “Domain generalization for text classification with memory-based supervised contrastive learning,” in Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 6916–6926.
  25. T. Chen, S. Kornblith, M. Norouzi, and G. Hinton, “A simple framework for contrastive learning of visual representations,” arXiv preprint arXiv:2002.05709, 2020.
  26. K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrast for unsupervised visual representation learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9729–9738.
  27. P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,” Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020.
  28. B. Gunel, J. Du, A. Conneau, and V. Stoyanov, “Supervised contrastive learning for pre-trained language model fine-tuning,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics, 2021, pp. 713–728.
  29. T. Gao, X. Yao, and D. Chen, “Simcse: Simple contrastive learning of sentence embeddings,” arXiv preprint arXiv:2104.08821, 2021.
  30. P. Shaeri and A. Katanforoush, “A semi-supervised fake news detection using sentiment encoding and lstm with self-attention,” in 2023 13th International Conference on Computer and Knowledge Engineering (ICCKE).   IEEE, 2023, pp. 590–595.
  31. D. Wang, N. Ding, P. Li, and H.-T. Zheng, “Cline: Contrastive learning with semantic negative examples for natural language understanding,” arXiv preprint arXiv:2107.00440, 2021.
  32. J. Robinson, C.-Y. Chuang, S. Sra, and S. Jegelka, “Contrastive learning with hard negative samples,” arXiv preprint arXiv:2010.04592, 2020.
  33. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  34. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  35. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, I. Stoica, and E. P. Xing, “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” March 2023. [Online]. Available: https://lmsys.org/blog/2023-03-30-vicuna/
  36. Z. Tan, A. Beigi, S. Wang, R. Guo, A. Bhattacharjee, B. Jiang, M. Karami, J. Li, L. Cheng, and H. Liu, “Large language models for data annotation: A survey,” arXiv preprint arXiv:2402.13446, 2024.
  37. A. Mehrban and P. Ahadian, “evaluating bert and parsbert for analyzing persian advertisement data,” arXiv preprint arXiv:2305.02426, 2023.
  38. Y. Sun, J. He, L. Cui, S. Lei, and C.-T. Lu, “Exploring the deceptive power of llm-generated fake news: A study of real-world detection challenges,” arXiv preprint arXiv:2403.18249, 2024.
  39. J. Wu, S. Yang, R. Zhan, Y. Yuan, D. F. Wong, and L. S. Chao, “A survey on llm-gernerated text detection: Necessity, methods, and future directions,” arXiv preprint arXiv:2310.14724, 2023.
  40. D. Li, Z. Sun, X. Hu, Z. Liu, Z. Chen, B. Hu, A. Wu, and M. Zhang, “A survey of large language models attribution,” arXiv preprint arXiv:2311.03731, 2023.
  41. A. Rosenfeld and T. Lazebnik, “Whose llm is it anyway? linguistic comparison and llm attribution for gpt-3.5, gpt-4 and bard,” arXiv preprint arXiv:2402.14533, 2024.
  42. J. Wu and B. Hooi, “Fake news in sheep’s clothing: Robust fake news detection against llm-empowered style attacks,” arXiv preprint arXiv:2310.10830, 2023.
  43. T. Kumarage, P. Sheth, R. Moraffah, J. Garland, and H. Liu, “How reliable are ai-generated-text detectors? an assessment framework using evasive soft prompts,” arXiv preprint arXiv:2310.05095, 2023.
  44. F. Wu and Y. Huang, “Sentiment domain adaptation with multiple sources,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2016, pp. 301–310.
  45. X. Ding, Q. Shi, B. Cai, T. Liu, Y. Zhao, and Q. Ye, “Learning multi-domain adversarial neural networks for text classification,” IEEE Access, vol. 7, pp. 40 323–40 332, 2019.
  46. H. Zhao, S. Zhang, G. Wu, J. M. Moura, J. P. Costeira, and G. J. Gordon, “Adversarial multiple source domain adaptation,” Advances in neural information processing systems, vol. 31, 2018.
  47. A. Ramponi and B. Plank, “Neural unsupervised domain adaptation in nlp—a survey,” arXiv preprint arXiv:2006.00632, 2020.
  48. P. Singhal, R. Walambe, S. Ramanna, and K. Kotecha, “Domain adaptation: challenges, methods, datasets, and applications,” IEEE access, vol. 11, pp. 6973–7020, 2023.
  49. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), vol. 2, no. 3, p. 6, 2023.
  50. K. Shu, D. Mahudeswaran, S. Wang, D. Lee, and H. Liu, “Fakenewsnet: A data repository with news content, social context, and spatiotemporal information for studying fake news on social media,” Big data, vol. 8, no. 3, pp. 171–188, 2020.
  51. L. Cui and D. Lee, “Coaid: Covid-19 healthcare misinformation dataset,” arXiv preprint arXiv:2006.00885, 2020.
  52. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  53. P. He, X. Liu, J. Gao, and W. Chen, “Deberta: Decoding-enhanced bert with disentangled attention,” arXiv preprint arXiv:2006.03654, 2020.
  54. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
  55. K. Pelrine, J. Danovitch, and R. Rabbany, “The surprising performance of simple baselines for misinformation detection,” in Proceedings of the Web Conference 2021, 2021, pp. 3432–3441.
  56. M. McCloskey and N. J. Cohen, “Catastrophic interference in connectionist networks: The sequential learning problem,” in Psychology of learning and motivation.   Elsevier, 1989, vol. 24, pp. 109–165.
  57. A. Aghajanyan, A. Shrivastava, A. Gupta, N. Goyal, L. Zettlemoyer, and S. Gupta, “Better fine-tuning by reducing representational collapse,” arXiv preprint arXiv:2008.03156, 2020.
  58. L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, 2008.
  59. Y. Tao, “Meta learning enabled adversarial defense,” in 2023 IEEE International Conference on Sensors, Electronics and Computer Engineering (ICSECE).   IEEE, 2023, pp. 1326–1330.
  60. Z. Tan, C. Zhao, R. Moraffah, Y. Li, S. Wang, J. Li, T. Chen, and H. Liu, “” glue pizza and eat rocks”–exploiting vulnerabilities in retrieval-augmented generative models,” arXiv preprint arXiv:2406.19417, 2024.
  61. Y. Tao, Y. Jia, N. Wang, and H. Wang, “The fact: Taming latent factor models for explainability with factorization trees,” in Proceedings of the 42nd international ACM SIGIR conference on research and development in information retrieval, 2019, pp. 295–304.
  62. Z. Tan, L. Cheng, S. Wang, B. Yuan, J. Li, and H. Liu, “Interpreting pretrained language models via concept bottlenecks,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining.   Springer, 2024, pp. 56–74.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Alimohammad Beigi (6 papers)
  2. Zhen Tan (68 papers)
  3. Nivedh Mudiam (2 papers)
  4. Canyu Chen (26 papers)
  5. Kai Shu (88 papers)
  6. Huan Liu (283 papers)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets