Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Responsible Generative AI: What to Generate and What Not (2404.05783v2)

Published 8 Apr 2024 in cs.CY, cs.AI, cs.CL, and cs.CV

Abstract: In recent years, generative AI (GenAI), like LLMs and text-to-image models, has received significant attention across various domains. However, ensuring the responsible generation of content by these models is crucial for their real-world applicability. This raises an interesting question: What should responsible GenAI generate, and what should it not? To answer the question, this paper investigates the practical responsible requirements of both textual and visual generative models, outlining five key considerations: generating truthful content, avoiding toxic content, refusing harmful instruction, leaking no training data-related content, and ensuring generated content identifiable. Specifically, we review recent advancements and challenges in addressing these requirements. Besides, we discuss and emphasize the importance of responsible GenAI across healthcare, education, finance, and artificial general intelligence domains. Through a unified perspective on both textual and visual generative models, this paper aims to provide insights into practical safety-related issues and further benefit the community in building responsible GenAI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (607)
  1. Clip interrogator, 2024. URL https://github.com/pharmapsychotic/clip-interrogator. Access: 14-03-2024.
  2. Semdedup: Data-efficient learning at web-scale through semantic deduplication. arXiv preprint arXiv:2303.09540, 2023.
  3. A simple yet efficient ensemble approach for ai-generated text detection. arXiv preprint arXiv:2311.03084, 2023.
  4. Adversarial watermarking transformer: Towards tracing text provenance with data hiding. In 2021 IEEE Symposium on Security and Privacy (SP), pp.  121–140. IEEE, 2021.
  5. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. In Proceedings of the 16th ACM Workshop on Artificial Intelligence and Security, pp.  79–90, 2023.
  6. Large language models associate muslims with violence. Nature Machine Intelligence, 3(6):461–463, 2021.
  7. Detectors for safe and reliable llms: Implementations, uses, and limitations. arXiv preprint arXiv:2403.06009, 2024.
  8. Turning your weakness into a strength: Watermarking deep neural networks by backdooring. In 27th USENIX Security Symposium (USENIX Security 18), pp.  1615–1631, 2018.
  9. Mesonet: a compact facial video forgery detection network. In 2018 IEEE international workshop on information forensics and security (WIFS), pp.  1–7. IEEE, 2018.
  10. Do language models know when they’re hallucinating references? arXiv preprint arXiv:2305.18248, 2023.
  11. Sina Ahmadi. Open ai and its impact on fraud detection in financial industry. Sina, A.(2023). Open AI and its Impact on Fraud Detection in Financial Industry. Journal of Knowledge Learning and Science Technology ISSN, pp.  2959–6386, 2023.
  12. How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In International Conference on Machine Learning, pp.  290–306. PMLR, 2022.
  13. Detecting language model attacks with perplexity. arXiv preprint arXiv:2308.14132, 2023.
  14. Deepfake video detection through optical flow based cnn. In Proceedings of the IEEE/CVF international conference on computer vision workshops, pp.  0–0, 2019.
  15. Concrete problems in ai safety. arXiv preprint arXiv:1606.06565, 2016.
  16. How to remove backdoors in diffusion models? In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023.
  17. Elijah: Eliminating backdoors injected in diffusion models via distribution shift. In Proceedings of the AAAI Conference on Artificial Intelligence, 2024.
  18. Generating synthetic data in finance: opportunities, challenges and pitfalls. In Proceedings of the First ACM International Conference on AI in Finance, pp.  1–8, 2020.
  19. Natural language processing for information assurance and security: an overview and implementations. In Proceedings of the 2000 workshop on New security paradigms, pp.  51–65, 2001.
  20. Natural language watermarking and tamperproofing. In International workshop on information hiding, pp.  196–212. Springer, 2002.
  21. Generating fact checking explanations. arXiv preprint arXiv:2004.05773, 2020.
  22. Transdeeplab: Convolution-free transformer-based deeplab v3+ for medical image segmentation. In International Workshop on PRedictive Intelligence In MEdicine, pp.  91–102. Springer, 2022.
  23. The internal state of an llm knows when its lying. arXiv preprint arXiv:2304.13734, 2023.
  24. Enhanced neurologic concept recognition using a named entity recognition model based on transformers. Frontiers in Digital Health, 4:1065581, 2022.
  25. Molgpt: molecular generation using a transformer-decoder model. Journal of Chemical Information and Modeling, 62(9):2064–2076, 2021.
  26. (ab)using images and sounds for indirect instruction injection in multi-modal llms. arXiv preprint arXiv:2307.10490, 2023.
  27. On training sample memorization: Lessons from benchmarking generative modeling with a large-scale competition. In Proceedings of the 27th ACM SIGKDD conference on knowledge discovery & data mining, pp.  2534–2542, 2021.
  28. Education in the era of generative artificial intelligence (ai): Understanding the potential benefits of chatgpt in promoting teaching and learning. Journal of AI, 7(1):52–62, 2023.
  29. Image hijacking: Adversarial images can control generative models at runtime. arXiv preprint arXiv:2309.00236, 2023.
  30. Improving few-shot generalization of safety classifiers via data augmented parameter-efficient fine-tuning. arXiv preprint arXiv:2310.16959, 2023.
  31. The power of generative ai: A review of requirements, models, input–output formats, evaluation metrics, and challenges. Future Internet, 15(8):260, 2023.
  32. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. arXiv preprint arXiv:2302.04023, 2023.
  33. Identifying and mitigating the security risks of generative ai. Foundations and Trends® in Privacy and Security, 6(1):1–52, 2023.
  34. Replacement attack: A new zero text watermarking attack. 3D Research, 8:1–9, 2017.
  35. Evaluating the underlying gender bias in contextualized word embeddings. arXiv preprint arXiv:1904.08783, 2019.
  36. Inspecting the geographical representativeness of images from text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  5136–5147, 2023.
  37. Influence functions in deep learning are fragile. arXiv preprint arXiv:2006.14651, 2020.
  38. Protein sequence profile prediction using protalbert transformer. Computational Biology and Chemistry, 99:107717, 2022.
  39. Theoretical guarantees on the best-of-n alignment policy. arXiv preprint arXiv:2401.01879, 2024.
  40. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pp.  610–623, 2021.
  41. Improving image generation with better captions. Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2(3):8, 2023.
  42. Generative ai in healthcare: A trustworthy approach. 2023. URL https://openreview.net/forum?id=1WSd408I9M.
  43. JOSEPH R. BIDEN. Executive order on the safe, secure, and trustworthy development and use of artificial intelligence, Oct 2023. URL https://www.whitehouse.gov/briefing-room/presidential-actions/2023/10/30/executive-order-on-the-safe-secure-and-trustworthy-development-and-use-of-artificial-intelligence/.
  44. Large image datasets: A pyrrhic win for computer vision? In 2021 IEEE Winter Conference on Applications of Computer Vision (WACV), pp.  1536–1546. IEEE, 2021.
  45. Training diffusion models with reinforcement learning. arXiv preprint arXiv:2305.13301, 2023.
  46. Bloomberg. Introducing bloomberggpt, bloomberg’s 50-billion parameter large language model, purpose-built from scratch for finance. arXiv preprint arXiv:2303.17564, 2023.
  47. Combining protein sequences and structures with transformers and equivariant graph neural networks to predict protein function. Bioinformatics, 39(Supplement_1):i318–i325, 2023.
  48. Man is to computer programmer as woman is to homemaker? debiasing word embeddings. Advances in neural information processing systems, 29, 2016.
  49. Mitigating inappropriateness in image generation: Can there be value in reflecting the world’s ugliness? arXiv preprint arXiv:2305.18398, 2023.
  50. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022.
  51. Rt-2: Vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818, 2023.
  52. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  53. Adversarial patch. arXiv preprint arXiv:1712.09665, 2017.
  54. Truth, lies, and automation. Center for Security and Emerging technology, 1(1):2, 2021.
  55. Repmix: Representation mixing for robust attribution of synthesized images. In European Conference on Computer Vision, pp.  146–163. Springer, 2022.
  56. A new approach of the cryptographic attacks. In Digital Information and Communication Technology and Its Applications: International Conference, DICTAP 2011, Dijon, France, June 21-23, 2011. Proceedings, Part I, pp.  521–534. Springer, 2011.
  57. Defending against alignment-breaking attacks via robustly aligned llm. arXiv preprint arXiv:2309.14348, 2023a.
  58. Difffashion: Reference-based fashion design with structure-aware transfer by diffusion models. IEEE Transactions on Multimedia, 2023b.
  59. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM workshop on artificial intelligence and security, pp.  3–14, 2017a.
  60. Towards evaluating the robustness of neural networks. In 2017 ieee symposium on security and privacy (sp), pp.  39–57. IEEE, 2017b.
  61. The secret sharer: Evaluating and testing unintended memorization in neural networks. In 28th USENIX Security Symposium (USENIX Security 19), pp.  267–284, 2019.
  62. Extracting training data from large language models. In 30th USENIX Security Symposium (USENIX Security 21), pp.  2633–2650, 2021.
  63. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pp.  1897–1914. IEEE, 2022a.
  64. Quantifying memorization across neural language models. arXiv preprint arXiv:2202.07646, 2022b.
  65. Are aligned neural networks adversarially aligned? arXiv preprint arXiv:2306.15447, 2023a.
  66. Extracting training data from diffusion models. In 32nd USENIX Security Symposium (USENIX Security 23), pp.  5253–5270, 2023b.
  67. Transformer-based protein generation with regularized latent space optimization. Nature Machine Intelligence, 4(10):840–851, 2022.
  68. Adversarial attacks and defences: A survey. arXiv preprint arXiv:1810.00069, 2018.
  69. Jailbreaking black box large language models in twenty queries. arXiv preprint arXiv:2310.08419, 2023.
  70. Application of a domain-specific bert for detection of speech recognition errors in radiology reports. Radiology: Artificial Intelligence, 4(4):e210185, 2022.
  71. From fiction to fact: the growing role of generative ai in business and finance. Journal of Chinese Economic and Business Studies, 21(4):471–496, 2023a.
  72. Gan-leaks: A taxonomy of membership inference attacks against generative models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp.  343–362, 2020.
  73. Deepinspect: A black-box trojan detection and mitigation framework for deep neural networks. In IJCAI, 2019.
  74. Complex claim verification with evidence retrieved in the wild. arXiv preprint arXiv:2305.11859, 2023b.
  75. Backdoor learning on sequence to sequence models. arXiv preprint arXiv:2305.02424, 2023c.
  76. Red teaming gpt-4v: Are gpt-4v safe against uni/multi-modal jailbreak attacks? arXiv preprint arXiv:2404.03411, 2024.
  77. Trojdiff: Trojan attacks on diffusion models with diverse targets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4035–4044, 2023d.
  78. Badnl: Backdoor attacks against nlp models with semantic-preserving improvements. In Proceedings of the 37th Annual Computer Security Applications Conference, pp.  554–569, 2021a.
  79. Refit: a unified watermark removal framework for deep learning systems with limited data. In Proceedings of the 2021 ACM Asia Conference on Computer and Communications Security, pp.  321–335, 2021b.
  80. Generative ai in medical practice: In-depth exploration of privacy and security challenges. Journal of Medical Internet Research, 26:e53008, 2024.
  81. Evaluating hallucinations in chinese large language models. arXiv preprint arXiv:2310.03368, 2023.
  82. Prompting4debugging: Red-teaming text-to-image diffusion models by finding problematic prompts. arXiv preprint arXiv:2309.06135, 2023.
  83. Label-only membership inference attacks. In International conference on machine learning, pp.  1964–1974. PMLR, 2021.
  84. How to backdoor diffusion models? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4015–4024, 2023.
  85. Villandiffusion: A unified backdoor attack framework for diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  86. Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30, 2017.
  87. Debiasing vision-language models via biased prompts. arXiv preprint arXiv:2302.00070, 2023.
  88. Fakecatcher: Detection of synthetic portrait videos using biological signals. IEEE transactions on pattern analysis and machine intelligence, 2020.
  89. Juan Cobarrubias et al. Ethical issues in status planning. Progress in language planning: International perspectives, pp.  41–85, 1983.
  90. Detecting images generated by diffusers. arXiv preprint arXiv:2303.05275, 2023.
  91. Testing relational understanding in text-guided image generation. arXiv preprint arXiv:2208.00005, 2022.
  92. On the detection of synthetic images generated by diffusion models. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  93. Reward model ensembles help mitigate overoptimization. arXiv preprint arXiv:2310.02743, 2023.
  94. Secure spread spectrum watermarking for images, audio and video. In Proceedings of 3rd IEEE international conference on image processing, volume 3, pp.  243–246. IEEE, 1996.
  95. Noiseprint: A cnn-based camera model fingerprint. IEEE Transactions on Information Forensics and Security, 15:144–159, 2019.
  96. Sparse and imperceivable adversarial attacks. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4724–4732, 2019.
  97. Reliable evaluation of adversarial robustness with an ensemble of diverse parameter-free attacks. In ICML, 2020.
  98. Holistic analysis of hallucination in gpt-4v (ision): Bias and interference challenges. arXiv preprint arXiv:2311.03287, 2023a.
  99. Diffusionshield: A watermark for copyright protection against generative diffusion models. arXiv preprint arXiv:2306.04642, 2023b.
  100. Suresh Budha Dahal. Utilizing generative ai for real-time financial market analysis opportunities and challenges. Advances in Intelligent Information Systems, 8(4):1–11, 2023.
  101. Towards near-imperceptible steganographic text. arXiv preprint arXiv:1907.06679, 2019.
  102. Training data attribution for diffusion models. arXiv preprint arXiv:2306.02174, 2023.
  103. Discovering the hidden vocabulary of dalle-2. arXiv preprint arXiv:2206.00169, 2022.
  104. Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164, 2019.
  105. T Davenport. How morgan stanley is training gpt to help financial advisors. Forbes Magazine, May, 20, 2023.
  106. Editing factual knowledge in language models. arXiv preprint arXiv:2104.08164, 2021.
  107. What’s in a text-to-image prompt? the potential of stable diffusion in visual arts education. Heliyon, 9(6), 2023.
  108. Jailbreaker: Automated jailbreak across multiple large language model chatbots. arXiv preprint arXiv:2307.08715, 2023a.
  109. Deepfake video detection based on efficientnet-v2 network. Computational Intelligence and Neuroscience, 2022, 2022.
  110. Multilingual jailbreak challenges in large language models. arXiv preprint arXiv:2310.06474, 2023b.
  111. Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335, 2023.
  112. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  113. Build it break it fix it for dialogue safety: Robustness from adversarial human attack. arXiv preprint arXiv:1908.06083, 2019.
  114. A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily. arXiv preprint arXiv:2311.08268, 2023.
  115. Lira: Learnable, imperceptible and robust backdoor attacks. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  11966–11976, 2021.
  116. Differentially private diffusion models. arXiv preprint arXiv:2210.09929, 2022.
  117. Boosting adversarial attacks with momentum. In Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR), pp.  9185–9193, 2018.
  118. Viewfool: Evaluating the robustness of visual recognition to adversarial viewpoints. Advances in Neural Information Processing Systems, 35:36789–36803, 2022.
  119. How robust is google’s bard to adversarial image attacks? arXiv preprint arXiv:2309.11751, 2023.
  120. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378, 2023.
  121. Stable diffusion is unstable. Advances in Neural Information Processing Systems, 36, 2024.
  122. Reduce, reuse, recycle: Compositional generation with energy-based diffusion models and mcmc. In International conference on machine learning, pp.  8489–8510. PMLR, 2023.
  123. On the privacy risk of in-context learning. In The 61st Annual Meeting Of The Association For Computational Linguistics, 2023a.
  124. A survey of embodied ai: From simulators to research tasks. IEEE Transactions on Emerging Topics in Computational Intelligence, 6(2):230–244, 2022.
  125. Are diffusion models vulnerable to membership inference attacks? In International Conference on Machine Learning, pp.  8717–8730. PMLR, 2023b.
  126. Backdooring convolutional neural networks via targeted weight perturbations. In 2020 IEEE International Joint Conference on Biometrics (IJCB), pp.  1–9. IEEE, 2020.
  127. What generative ai means for trust in health communications. Journal of Communication in Healthcare, 16(4):385–388, 2023.
  128. Generative adversarial networks in finance: an overview. arXiv preprint arXiv:2106.06364, 2021.
  129. Helping or herding? reward model ensembles mitigate but do not eliminate reward hacking. arXiv preprint arXiv:2312.09244, 2023.
  130. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  1625–1634, 2018.
  131. Reinforcement learning for fine-tuning text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  132. Generating steganographic text with lstms. arXiv preprint arXiv:1705.10742, 2017.
  133. Hany Farid. Lighting (in) consistency of paint by text. arXiv preprint arXiv:2207.13744, 2022a.
  134. Hany Farid. Perspective (in) consistency of paint by text. arXiv preprint arXiv:2206.14617, 2022b.
  135. Supervised gan watermarking for intellectual property protection. In 2022 IEEE International Workshop on Information Forensics and Security (WIFS), pp.  1–6. IEEE, 2022.
  136. What neural networks memorize and why: Discovering the long tail via influence estimation. Advances in Neural Information Processing Systems, 33:2881–2891, 2020.
  137. When do gans replicate? on the choice of dataset size. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  6701–6710, 2021.
  138. The stable signature: Rooting watermarks in latent diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  22466–22477, 2023.
  139. Emilio Ferrara. Should chatgpt be biased? challenges and risks of bias in large language models. arXiv preprint arXiv:2304.03738, 2023.
  140. Leveraging frequency analysis for deep fake image recognition. In International conference on machine learning, pp.  3247–3258. PMLR, 2020.
  141. Model inversion attacks that exploit confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC conference on computer and communications security, pp.  1322–1333, 2015.
  142. Privacy in pharmacogenetics: An {{\{{End-to-End}}\}} case study of personalized warfarin dosing. In 23rd USENIX security symposium (USENIX Security 14), pp.  17–32, 2014.
  143. Mathematical capabilities of chatgpt. Advances in Neural Information Processing Systems, 36, 2024.
  144. Fair diffusion: Instructing text-to-image generation models on fairness. arXiv preprint arXiv:2302.10893, 2023.
  145. Practical membership inference attacks against fine-tuned large language models via self-prompt calibration. arXiv preprint arXiv:2311.06062, 2023a.
  146. A probabilistic fluctuation based membership inference attack for diffusion models. arXiv e-prints, pp.  arXiv–2308, 2023b.
  147. An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  148. Boris A Galitsky. Truth-o-meter: Collaborating with llm in fighting its hallucinations. Preprints, 2023.
  149. Unsupervised and distributional detection of machine-generated text. arXiv preprint arXiv:2111.02878, 2021.
  150. Combat ai with ai: Counteract machine-generated fake restaurant reviews on social media. arXiv preprint arXiv:2302.07731, 2023.
  151. On pushing deepfake tweet detection capabilities to the limits. In Proceedings of the 14th ACM Web Science Conference 2022, pp.  154–163, 2022.
  152. Scalable detection of offensive and non-compliant content/logo in product images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp.  2247–2256, 2020.
  153. Erasing concepts from diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2426–2436, 2023.
  154. Unified concept editing in diffusion models. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  5111–5120, 2024.
  155. Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned. arXiv preprint arXiv:2209.07858, 2022.
  156. Evaluating the robustness of text-to-image diffusion models against real-world attacks. arXiv preprint arXiv:2306.13103, 2023a.
  157. Backdoor defense via adaptively splitting poisoned dataset. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  4005–4014, 2023b.
  158. Inducing high energy-latency of large vision-language models with verbose images. arXiv preprint arXiv:2401.11170, 2024.
  159. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  160. Rarr: Researching and revising what language models say, using language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  16477–16508, 2023c.
  161. Mart: Improving llm safety with multi-round automatic red-teaming. arXiv preprint arXiv:2311.07689, 2023.
  162. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  163. Gltr: Statistical detection and visualization of generated text. arXiv preprint arXiv:1906.04043, 2019.
  164. The journey, not the destination: How data guides diffusion models. arXiv preprint arXiv:2312.06205, 2023.
  165. Differentially private diffusion models generate useful synthetic images. arXiv preprint arXiv:2302.13861, 2023.
  166. Llm censorship: A machine learning challenge or a computer security problem? arXiv preprint arXiv:2307.10719, 2023.
  167. Ben Goertzel. Human-level artificial general intelligence and the possibility of a technological singularity: A reaction to ray kurzweil’s the singularity is near, and mcdermott’s critique of kurzweil. Artificial Intelligence, 171(18):1161–1173, 2007.
  168. Benchmarking spatial relationships in text-to-image generation. arXiv preprint arXiv:2212.10015, 2022.
  169. Training data protection with compositional diffusion models. arXiv preprint arXiv:2308.01937, 2023.
  170. Dataset security for machine learning: Data poisoning, backdoor attacks, and defenses. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(2):1563–1580, 2022.
  171. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  172. Explaining and harnessing adversarial examples. In International conference on learning representations (ICLR), 2015.
  173. Assessing the factual accuracy of generated text. In proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, pp.  166–175, 2019.
  174. Critic: Large language models can self-correct with tool-interactive critiquing. arXiv preprint arXiv:2305.11738, 2023.
  175. A survey of adversarial defenses and robustness in nlp. ACM Computing Surveys, 55(14s):1–39, 2023.
  176. On the (statistical) detection of adversarial examples. arXiv preprint arXiv:1702.06280, 2017.
  177. Effective and efficient vote attack on capsule networks. The International Conference on Learning Representations (ICLR), 2021.
  178. Evaluating model robustness to patch perturbations. In ICML 2022 Shift Happens Workshop, 2022a.
  179. Are vision transformers robust to patch perturbations? In European Conference on Computer Vision, pp.  404–421. Springer, 2022b.
  180. Segpgd: An effective and efficient adversarial attack for evaluating and boosting segmentation robustness. In European Conference on Computer Vision, pp.  308–325. Springer, 2022c.
  181. A systematic survey of prompt engineering on vision-language foundation models. arXiv preprint arXiv:2307.12980, 2023a.
  182. A survey on transferability of adversarial examples across deep neural networks. arXiv preprint arXiv:2310.17626, 2023b.
  183. Badnets: Evaluating backdooring attacks on deep neural networks. IEEE Access, 7:47230–47244, 2019.
  184. Controllable text generation via probability density estimation in the latent space. arXiv preprint arXiv:2212.08307, 2022d.
  185. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE international conference on advanced video and signal based surveillance (AVSS), pp.  1–6. IEEE, 2018.
  186. Adaptive diffusion priors for accelerated mri reconstruction. Medical Image Analysis, 88:102872, 2023.
  187. Detecting and preventing hallucinations in large vision language models. arXiv preprint arXiv:2308.06394, 2023.
  188. How close is chatgpt to human experts? comparison corpus, evaluation, and detection. arXiv preprint arXiv:2301.07597, 2023.
  189. Gradient-based adversarial attacks against text transformers. arXiv preprint arXiv:2104.13733, 2021.
  190. Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems. arXiv preprint arXiv:1908.01763, 2019.
  191. Design implications of generative ai systems for visual storytelling for young learners. In Proceedings of the 22nd Annual ACM Interaction Design and Children Conference, pp.  470–474, 2023.
  192. Spotting llms with binoculars: Zero-shot detection of machine-generated text. arXiv preprint arXiv:2401.12070, 2024.
  193. Optimizing prompts for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  194. Pre-trained perceptual features improve differentially private image generation. arXiv preprint arXiv:2205.12900, 2022.
  195. Aging with grace: Lifelong model editing with discrete key-value adaptors. arXiv preprint arXiv:2211.11031, 2022.
  196. Generating steganographic images via adversarial training. Advances in neural information processing systems, 30, 2017.
  197. Logan: Membership inference attacks against generative models. arXiv preprint arXiv:1705.07663, 2017.
  198. Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303, 2022a.
  199. Mgtbench: Benchmarking machine-generated text detection. arXiv preprint arXiv:2303.14822, 2023.
  200. Protecting intellectual property of language generation apis with lexical watermark. In Proceedings of the AAAI Conference on Artificial Intelligence, 2022b.
  201. Llm self defense: By self examination, llms know they are being tricked. arXiv preprint arXiv:2308.07308, 2023.
  202. Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740, 2023.
  203. Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
  204. Monte carlo and reconstruction membership inference attacks against generative models. Proceedings on Privacy Enhancing Technologies, 2019.
  205. Semantic object accuracy for generative text-to-image synthesis. IEEE transactions on pattern analysis and machine intelligence, 44(3):1552–1565, 2020.
  206. Membership inference attacks on sequence-to-sequence models: Is my data in your machine translation system? Transactions of the Association for Computational Linguistics, 8:49–63, 2020.
  207. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  208. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751, 2019.
  209. On the effectiveness of mitigating data poisoning attacks with gradient shaping. arXiv preprint arXiv:2002.11497, 2020.
  210. Membership inference of diffusion models. arXiv preprint arXiv:2301.09956, 2023.
  211. Exposing gan-generated faces using inconsistent corneal specular highlights. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  2500–2504. IEEE, 2021.
  212. Gradient cuff: Detecting jailbreak attacks on large language models by exploring refusal loss landscapes. arXiv preprint arXiv:2403.00867, 2024.
  213. Are large pre-trained language models leaking your personal information? arXiv preprint arXiv:2205.12628, 2022.
  214. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232, 2023a.
  215. Universal physical camouflage attacks on object detectors. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  720–729, 2020.
  216. Initiative defense against facial manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2021.
  217. Can large language models explain themselves? a study of llm-generated self-explanations. arXiv preprint arXiv:2310.11207, 2023b.
  218. Zero-day backdoor attack against text-to-image diffusion models via personalization. arXiv preprint arXiv:2305.10701, 2023c.
  219. Harnessing the power of chatgpt in fake news: An in-depth exploration in generation, detection and explanation. arXiv preprint arXiv:2310.05046, 2023.
  220. Social biases in nlp models as barriers for persons with disabilities. arXiv preprint arXiv:2005.00813, 2020.
  221. Datamodels: Predicting predictions from training data. arXiv preprint arXiv:2202.00622, 2022.
  222. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  223. Automatic detection of generated text is easiest when humans are fooled. arXiv preprint arXiv:1911.00650, 2019.
  224. Preventing generation of verbatim memorization in language models gives a false sense of privacy. In Proceedings of the 16th International Natural Language Generation Conference, pp.  28–53. Association for Computational Linguistics, 2023.
  225. Llm platform security: Applying a systematic evaluation framework to openai’s chatgpt plugins. arXiv preprint arXiv:2309.10254, 2023.
  226. Leveraging generative ai models for synthetic data generation in healthcare: Balancing research and privacy. In 2023 International Conference on Smart Applications, Communications and Networking (SmartNets), pp.  1–4. IEEE, 2023.
  227. Membership inference attack susceptibility of clinical language models. arXiv preprint arXiv:2104.08305, 2021.
  228. A systematic review of hate speech automatic detection using natural language processing. Neurocomputing, pp.  126232, 2023.
  229. Baseline defenses for adversarial attacks against aligned language models. arXiv preprint arXiv:2309.00614, 2023.
  230. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  231. Intrinsic certified robustness of bagging against data poisoning attacks. In Proceedings of the AAAI conference on artificial intelligence, 2021.
  232. Revisiting and exploring efficient fast adversarial training via law: Lipschitz regularization and auto weight averaging. arXiv preprint arXiv:2308.11443, 2023.
  233. Fast propagation is better: Accelerating single-step adversarial training via sampling subnetworks. IEEE Transactions on Information Forensics and Security, 2024.
  234. Active retrieval augmented generation. arXiv preprint arXiv:2305.06983, 2023.
  235. Automatically auditing large language models via discrete optimization. arXiv preprint arXiv:2303.04381, 2023.
  236. Semantic adversarial attacks: Parametric transformations that fool deep classifiers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  4773–4783, 2019.
  237. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221, 2022.
  238. Exploiting programmatic behavior of llms: Dual-use through standard security attacks. arXiv preprint arXiv:2302.05733, 2023.
  239. Unfamiliar finetuning examples control how language models hallucinate. arXiv preprint arXiv:2403.05612, 2024.
  240. Realtime qa: What’s the answer right now? arXiv preprint arXiv:2207.13332, 2022.
  241. Aly M Kassem. Mitigating approximate memorization in language models via dissimilarity learned policy. arXiv preprint arXiv:2305.01550, 2023.
  242. Decomposed prompting: A modular approach for solving complex tasks. arXiv preprint arXiv:2210.02406, 2022.
  243. Diffusemorph: Unsupervised deformable image registration using diffusion model. In European conference on computer vision, pp.  347–364. Springer, 2022a.
  244. Diffusion adversarial representation learning for self-supervised vessel segmentation. arXiv preprint arXiv:2209.14566, 2022b.
  245. Decentralized attribution of generative models. arXiv preprint arXiv:2010.13974, 2020.
  246. Robustifying language models via adversarial training with masked gradient. 2022c.
  247. Roast: Robustifying language models via adversarial perturbation with selective training. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  3412–3444, 2023a.
  248. Propile: Probing privacy leakage in large language models. arXiv preprint arXiv:2307.01881, 2023b.
  249. Bias-to-text: Debiasing unknown visual biases through language interpretation. arXiv preprint arXiv:2301.11104, 2023c.
  250. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  251. A watermark for large language models. arXiv preprint arXiv:2301.10226, 2023.
  252. On robustness-accuracy characterization of large language models using synthetic datasets. In International Conference on Machine Learning, 2023.
  253. Understanding black-box predictions via influence functions. In International conference on machine learning, pp.  1885–1894. PMLR, 2017.
  254. An efficient membership inference attack for the diffusion model by proximal initialization. arXiv preprint arXiv:2305.18355, 2023.
  255. Character as pixels: A controllable prompt adversarial attacking framework for black-box text guided image generation models. In Proceedings of the 32nd International Joint Conference on Artificial Intelligence (IJCAI 2023), pp.  983–990, 2023.
  256. David Krause. Large language models and generative ai in finance: An analysis of chatgpt, bard, and bing ai. Bard, and Bing AI (July 15, 2023), 2023.
  257. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems, 25, 2012.
  258. Ablating concepts in text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  22691–22702, 2023.
  259. The rise of generative artificial intelligence in healthcare. In 2023 12th Mediterranean Conference on Embedded Computing (MECO), pp.  1–4. IEEE, 2023.
  260. Summac: Re-visiting nli-based models for inconsistency detection in summarization. Transactions of the Association for Computational Linguistics, 10:163–177, 2022.
  261. Improving diversity of demographic representation in large language models via collective-critiques and self-voting. arXiv preprint arXiv:2310.16523, 2023.
  262. Influencer backdoor attack on semantic segmentation. arXiv preprint arXiv:2303.12054, 2023.
  263. Open sesame! universal black box jailbreaking of large language models. arXiv preprint arXiv:2309.01446, 2023.
  264. Single-model attribution via final-layer inversion. arXiv preprint arXiv:2306.06210, 2023.
  265. Deduplicating training data makes language models better. arXiv preprint arXiv:2107.06499, 2021.
  266. Aligning text-to-image models using human feedback. arXiv preprint arXiv:2302.12192, 2023.
  267. Factuality enhanced language models for open-ended text generation. Advances in Neural Information Processing Systems, 35:34586–34599, 2022.
  268. Does bert pretrained on clinical notes reveal sensitive data? arXiv preprint arXiv:2104.07762, 2021.
  269. Dall· e 2 fails to reliably capture common syntactic processes. Social Sciences & Humanities Open, 8(1):100648, 2023.
  270. Not with my name! inferring artists’ names of input strings employed by diffusion models. In International Conference on Image Analysis and Processing, pp.  364–375. Springer, 2023.
  271. Deep partition aggregation: Provable defense against general poisoning attacks. arXiv preprint arXiv:2006.14768, 2020.
  272. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023a.
  273. Multimodal foundation models: From specialists to general-purpose assistants. arXiv preprint arXiv:2309.10020, 1(2):2, 2023b.
  274. Prime: Protect your videos from malicious editing. arXiv preprint arXiv:2402.01239, 2024a.
  275. Kpgt: knowledge-guided pre-training of graph transformer for molecular property prediction. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, pp.  857–867, 2022a.
  276. Self-discovering interpretable diffusion latent directions for responsible text-to-image generation. arXiv preprint arXiv:2311.17216, 2023c.
  277. You don’t know my favorite color: Preventing dialogue representations from revealing speakers’ private personas. arXiv preprint arXiv:2205.10228, 2022b.
  278. Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197, 2023d.
  279. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning, pp.  12888–12900. PMLR, 2022c.
  280. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  6449–6464, 2023e.
  281. Inference-time intervention: Eliciting truthful answers from a language model. arXiv preprint arXiv:2306.03341, 2023f.
  282. Face x-ray for more general face forgery detection. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  5001–5010, 2020.
  283. Alleviating exposure bias in diffusion models through sampling with shifted time steps. arXiv preprint arXiv:2305.15583, 2023g.
  284. Drattack: Prompt decomposition and reconstruction makes powerful llm jailbreakers. arXiv preprint arXiv:2402.16914, 2024b.
  285. Deepinception: Hypnotize large language model to be jailbreaker. arXiv preprint arXiv:2311.03191, 2023h.
  286. Privacy-preserving prompt tuning for large language model services. arXiv preprint arXiv:2305.06212, 2023i.
  287. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023j.
  288. Backdoor learning: A survey. IEEE Transactions on Neural Networks and Learning Systems, 2022d.
  289. Exposing deepfake videos by detecting face warping artifacts. arXiv preprint arXiv:1811.00656, 2018.
  290. In ictu oculi: Exposing ai created fake videos by detecting eye blinking. In 2018 IEEE International workshop on information forensics and security (WIFS), pp.  1–7. IEEE, 2018.
  291. Invisible backdoor attack with sample-specific triggers. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  16463–16472, 2021.
  292. How to prove your model belongs to you: A blind-watermark based framework to protect intellectual property of dnn. In Proceedings of the 35th Annual Computer Security Applications Conference, pp.  126–137, 2019.
  293. Batgpt: A bidirectional autoregessive talker from generative pre-trained transformer. arXiv preprint arXiv:2307.00360, 2023k.
  294. Mist: Towards improved adversarial examples for diffusion models. arXiv preprint arXiv:2305.12683, 2023.
  295. Adversarial example does good: Preventing painting imitation from diffusion models via adversarial examples. arXiv preprint arXiv:2302.04578, 2023a.
  296. Gpt detectors are biased against non-native english writers. arXiv preprint arXiv:2304.02819, 2023b.
  297. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  298. Exposing attention glitches with flip-flop language modeling. arXiv preprint arXiv:2306.00946, 2023a.
  299. Model-agnostic origin attribution of generated images with few-shot examples. arXiv preprint arXiv:2404.02697, 2024a.
  300. Mitigating hallucination in large multi-modal models via robust instruction tuning. arXiv preprint arXiv:2306.14565, 1(2):9, 2023b.
  301. Riatig: Reliable and imperceptible adversarial text-to-image generation with natural prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  20585–20594, 2023c.
  302. Evaluating the logical reasoning ability of chatgpt and gpt-4. arXiv preprint arXiv:2304.03439, 2023d.
  303. Compositional visual generation with composable diffusion models. In European Conference on Computer Vision, pp.  423–439. Springer, 2022a.
  304. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023e.
  305. Design guidelines for prompt engineering text-to-image generative models. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems, pp.  1–23, 2022.
  306. Autodan: Generating stealthy jailbreak prompts on aligned large language models. arXiv preprint arXiv:2310.04451, 2023f.
  307. Coco: Coherence-enhanced machine-generated text detection under low resource with contrastive learning. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  16167–16188, 2023g.
  308. Watermark vaccine: Adversarial attacks to prevent watermark removal. In European Conference on Computer Vision, pp.  1–17. Springer, 2022b.
  309. Does few-shot learning suffer from backdoor attacks? In Proceedings of the AAAI Conference on Artificial Intelligence, 2024b.
  310. The devil is in the neurons: Interpreting and mitigating social biases in language models. In The Twelfth International Conference on Learning Representations, 2023h.
  311. Prompt injection attack against llm-integrated applications. arXiv preprint arXiv:2306.05499, 2023i.
  312. Watermarking diffusion model. arXiv preprint arXiv:2305.12502, 2023j.
  313. Neural trojans. In 2017 IEEE International Conference on Computer Design (ICCD), pp.  45–48. IEEE, 2017.
  314. A benchmark corpus for the detection of automatically generated text in academic publications. arXiv preprint arXiv:2202.02013, 2022.
  315. Explainable clinical coding with in-domain adapted transformers. Journal of Biomedical Informatics, 139:104323, 2023.
  316. Stable bias: Evaluating societal representations in diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  317. Digital camera identification from sensor pattern noise. IEEE Transactions on Information Forensics and Security, 1(2):205–214, 2006.
  318. Analyzing leakage of personally identifiable information in language models. arXiv preprint arXiv:2302.00539, 2023.
  319. An image is worth 1000 lies: Transferability of adversarial images across prompts on vision-language models. In The Twelfth International Conference on Learning Representations, 2024.
  320. Biogpt: generative pre-trained transformer for biomedical text generation and mining. Briefings in bioinformatics, 23(6):bbac409, 2022.
  321. Qing Lyu and Ge Wang. Conversion between ct and mri images using diffusion and score-matching models. arXiv preprint arXiv:2209.12104, 2022.
  322. Differentially private latent diffusion models. arXiv preprint arXiv:2305.15759, 2023.
  323. Improving adversarial transferability via model alignment. arXiv preprint arXiv:2311.18495, 2023a.
  324. Let’s do a thought experiment: Using counterfactuals to improve moral reasoning. arXiv preprint arXiv:2306.14308, 2023b.
  325. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.
  326. Selfcheckgpt: Zero-resource black-box hallucination detection for generative large language models. arXiv preprint arXiv:2303.08896, 2023.
  327. A very preliminary analysis of dall-e 2. arXiv preprint arXiv:2204.13807, 2022.
  328. A holistic approach to undesired content detection in the real world. In Proceedings of the AAAI Conference on Artificial Intelligence, 2023.
  329. Transformers are better than humans at identifying generated text. ArXiv abs/2009.13375, 2020.
  330. Do gans leave artificial fingerprints? In 2019 IEEE conference on multimedia information processing and retrieval (MIPR), pp.  506–511. IEEE, 2019.
  331. Two-branch recurrent network for isolating deepfakes in videos. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part VII 16, pp.  667–684. Springer, 2020.
  332. Membership inference attacks against diffusion models. In 2023 IEEE Security and Privacy Workshops (SPW), pp.  77–83. IEEE, 2023.
  333. Membership inference attacks against language models via neighbourhood comparison. arXiv preprint arXiv:2305.18462, 2023.
  334. On faithfulness and factuality in abstractive summarization. arXiv preprint arXiv:2005.00661, 2020.
  335. The radicalization risks of gpt-3 and advanced neural language models. arXiv preprint arXiv:2009.06807, 2020.
  336. Did the neurons read your book? document-level membership inference for large language models. arXiv preprint arXiv:2310.15007, 2023.
  337. Robust conversational agents against imperceptible toxicity triggers. arXiv preprint arXiv:2205.02392, 2022.
  338. Assert: Automated safety scenario red teaming for evaluating the robustness of large language models. arXiv preprint arXiv:2310.09624, 2023.
  339. Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372, 2022.
  340. Bidirectional representation learning from transformers using multimodal electronic health record data to predict depression. IEEE journal of biomedical and health informatics, 25(8):3121–3129, 2021.
  341. Natural language watermarking via morphosyntactic alterations. Computer Speech & Language, 23(1):107–125, 2009.
  342. Selfcheck: Using llms to zero-shot check their own step-by-step reasoning. arXiv preprint arXiv:2308.00436, 2023.
  343. Raphaël Millière. Adversarial attacks on image generation with made-up words. arXiv preprint arXiv:2208.04135, 2022.
  344. Factscore: Fine-grained atomic evaluation of factual precision in long form text generation. arXiv preprint arXiv:2305.14251, 2023.
  345. Quantifying privacy risks of masked language models using membership inference attacks. arXiv preprint arXiv:2203.03929, 2022a.
  346. An empirical analysis of memorization in fine-tuned autoregressive language models. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  1816–1826, 2022b.
  347. The creation and detection of deepfakes: A survey. ACM computing surveys (CSUR), 54(1):1–41, 2021.
  348. Memory-based model editing at scale. In International Conference on Machine Learning, pp.  15817–15831. PMLR, 2022.
  349. Detectgpt: Zero-shot machine-generated text detection using probability curvature. arXiv preprint arXiv:2301.11305, 2023.
  350. Chatgpt or human? detect and explain. explaining decisions of machine learning model for detecting short chatgpt-generated text. arXiv preprint arXiv:2301.13852, 2023.
  351. Sparsefool: a few pixels make a big difference. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  9087–9096, 2019.
  352. Clipcap: Clip prefix for image captioning. arXiv preprint arXiv:2111.09734, 2021.
  353. Deepfakes detection with automatic face weighting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  668–669, 2020.
  354. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  2574–2582, 2016.
  355. Textattack: A framework for adversarial attacks, data augmentation, and adversarial training in nlp. arXiv preprint arXiv:2005.05909, 2020.
  356. Use of llms for illicit purposes: Threats, prevention measures, and vulnerabilities. arXiv preprint arXiv:2308.12833, 2023.
  357. Controlled decoding from language models. arXiv preprint arXiv:2310.17022, 2023.
  358. privgan: Protecting gans from membership inference attacks at low cost to utility. Proceedings on Privacy Enhancing Technologies, 2021.
  359. A multimodal comparison of latent denoising diffusion probabilistic models and generative adversarial networks for medical image synthesis. Scientific Reports, 13(1):12098, 2023.
  360. Stereoset: Measuring stereotypical bias in pretrained language models. arXiv preprint arXiv:2004.09456, 2020.
  361. Social biases through the text-to-image generation lens. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp.  786–808, 2023.
  362. Entity-level factual consistency of abstractive text summarization. arXiv preprint arXiv:2102.09130, 2021.
  363. Data augmentation of high frequency financial data using generative adversarial network. In 2020 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pp.  641–648. IEEE, 2020.
  364. Comprehensive privacy analysis of deep learning: Passive and active white-box inference attacks against centralized and federated learning. In 2019 IEEE symposium on security and privacy (SP), pp.  739–753. IEEE, 2019.
  365. Detecting gan generated fake images using co-occurrence matrices. arXiv preprint arXiv:1903.06836, 2019.
  366. Modelling temporal document sequences for clinical icd coding. arXiv preprint arXiv:2302.12666, 2023.
  367. Capsule-forensics: Using capsule networks to detect forged images and videos. In ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp.  2307–2311. IEEE, 2019.
  368. Input-aware dynamic backdoor attack. Advances in Neural Information Processing Systems, 33:3454–3464, 2020.
  369. Ores: Open-vocabulary responsible visual synthesis. arXiv preprint arXiv:2308.13785, 2023.
  370. Attributing image generative models using latent fingerprints. In International Conference on Machine Learning, pp.  26150–26165. PMLR, 2023.
  371. Automated radiology report generation using transformers. In 2023 3rd International Conference on Advanced Research in Computing (ICARC), pp.  90–95. IEEE, 2023.
  372. Elucidating the exposure bias in diffusion models. arXiv preprint arXiv:2308.15321, 2023a.
  373. Input perturbation reduces exposure bias in diffusion models. arXiv preprint arXiv:2301.11706, 2023b.
  374. Watermarking digital images for copyright protection. IEE PROCEEDINGS VISION IMAGE AND SIGNAL PROCESSING, 143:250–256, 1996.
  375. Diffmix: Diffusion model-based data synthesis for nuclei segmentation and classification in imbalanced pathology image datasets. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  337–345. Springer, 2023.
  376. Membership inference attacks with token-level deduplication on korean language models. IEEE Access, 11:10207–10217, 2023.
  377. Towards universal fake image detectors that generalize across generative models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  24480–24489, 2023.
  378. Entity cloze by date: What lms know about unseen entities. arXiv preprint arXiv:2205.02832, 2022.
  379. OpenAI. Gpt-4 technical report, 2023.
  380. OpenAI. Dall·e 2 pre-training mitigations, 2024a. URL https://openai.com/research/dall-e-2-pre-training-mitigations. Accessed: 2024-03-12.
  381. OpenAI. Sora, 2024b. URL https://openai.com/sora. Access: 14-03-2024.
  382. Jonas Oppenlaender. A taxonomy of prompt modifiers for text-to-image generation. Behaviour & Information Technology, pp.  1–14, 2023.
  383. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  384. Unsupervised medical image translation with adversarial diffusion models. IEEE Transactions on Medical Imaging, 2023.
  385. Controlling the extraction of memorized data from large language models via prompt-tuning. arXiv preprint arXiv:2305.11759, 2023.
  386. How to catch an ai liar: Lie detection in black-box llms by asking unrelated questions. arXiv preprint arXiv:2309.15840, 2023.
  387. 2d medical image synthesis using transformer-based denoising diffusion probabilistic model. Physics in Medicine & Biology, 68(10):105004, 2023.
  388. Hidden trigger backdoor attack on {{\{{NLP}}\}} models via linguistic style manipulation. In 31st USENIX Security Symposium (USENIX Security 22), pp.  3611–3628, 2022.
  389. White-box membership inference attacks against diffusion models. arXiv preprint arXiv:2308.06405, 2023.
  390. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016.
  391. Trak: Attributing model behavior at scale. arXiv preprint arXiv:2303.14186, 2023.
  392. “was it “stated” or was it “claimed”?: How linguistic bias affects generative language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  10080–10095, 2021.
  393. Detection of adversarial training examples in poisoning attacks through anomaly detection. arXiv preprint arXiv:1802.03041, 2018.
  394. Label sanitization against label flipping poisoning attacks. In ECML PKDD 2018 Workshops: Nemesis 2018, UrbReas 2018, SoGood 2018, IWAISe 2018, and Green Data Mining 2018, Dublin, Ireland, September 10-14, 2018, Proceedings 18, pp.  5–15. Springer, 2019.
  395. Protecting the intellectual property of diffusion models by the watermark diffusion process. arXiv preprint arXiv:2306.03436, 2023a.
  396. Intellectual property protection of dnn models. World Wide Web, 26(4):1877–1911, 2023b.
  397. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824, 2023c.
  398. Analyzing bias in diffusion-based face generation models. In 2023 IEEE International Joint Conference on Biometrics (IJCB), pp.  1–10. IEEE, 2023.
  399. Red teaming language models with language models. arXiv preprint arXiv:2202.03286, 2022a.
  400. Discovering language model behaviors with model-written evaluations. arXiv preprint arXiv:2212.09251, 2022b.
  401. Ignore previous prompt: Attack techniques for language models. arXiv preprint arXiv:2211.09527, 2022.
  402. Language model tokenizers introduce unfairness between languages. Advances in Neural Information Processing Systems, 36, 2024.
  403. Circumventing concept erasure methods for text-to-image generative models. In The Twelfth International Conference on Learning Representations, 2023.
  404. Visual adversarial examples jailbreak large language models. arXiv preprint arXiv:2306.13213, 2023a.
  405. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023b.
  406. Latent jailbreak: A benchmark for evaluating text safety and output robustness of large language models. arXiv preprint arXiv:2307.08487, 2023.
  407. Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp.  3403–3417, 2023.
  408. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  409. Aart: Ai-assisted red-teaming with diverse data generation for new llm-powered applications. arXiv preprint arXiv:2311.08592, 2023.
  410. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2024.
  411. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  412. Explain yourself! leveraging language models for commonsense reasoning. arXiv preprint arXiv:1906.02361, 2019.
  413. Tbt: Targeted neural network attack with bit trojan. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  13198–13207, 2020.
  414. In-context retrieval-augmented language models. arXiv preprint arXiv:2302.00083, 2023.
  415. Red-teaming the stable diffusion safety filter. arXiv preprint arXiv:2210.04610, 2022.
  416. Nitin Rane. Role and challenges of chatgpt and similar generative artificial intelligence in finance and accounting. Available at SSRN 4603206, 2023.
  417. Gemini or chatgpt? efficiency, performance, and adaptability of cutting-edge generative artificial intelligence (ai) in finance and accounting. Efficiency, Performance, and Adaptability of Cutting-Edge Generative Artificial Intelligence (AI) in Finance and Accounting (February 19, 2024), 2024.
  418. Sequence level training with recurrent neural networks. arXiv preprint arXiv:1511.06732, 2015.
  419. Tricking llms into disobedience: Understanding, analyzing, and preventing jailbreaks. arXiv preprint arXiv:2305.14965, 2023.
  420. The devil is in the gan: backdoor attacks and defenses in deep generative models. In European Symposium on Research in Computer Security, pp.  776–783. Springer, 2022.
  421. Nydia Remolina. Generative ai in finance: Risks and potential solutions. Finance: Risks and Potential Solutions (November 9, 2023). Singapore Management University School of Law Research Paper Forthcoming, SMU Centre for AI & Data Governance Research Paper Forthcoming, 2023.
  422. My art my choice: Adversarial protection against unruly ai. arXiv preprint arXiv:2309.03198, 2023.
  423. H. S. Richardson. Moral reasoning, 2018. URL https://plato.stanford.edu/entries/reasoning-moral/.
  424. Towards the detection of diffusion model deepfakes. arXiv preprint arXiv:2210.14571, 2022.
  425. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  426. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  1–11, 2019.
  427. Sohini Roychowdhury. Journey of hallucination-minimized generative ai solutions for financial decision makers. In Proceedings of the 17th ACM International Conference on Web Search and Data Mining, pp.  1180–1181, 2024.
  428. Disrupting deepfakes: Adversarial attacks against conditional image translation networks and facial manipulation systems. In Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV 16, pp.  236–251. Springer, 2020.
  429. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22500–22510, 2023.
  430. Can ai-generated text be reliably detected? arXiv preprint arXiv:2303.11156, 2023.
  431. Hidden trigger backdoor attacks. In Proceedings of the AAAI conference on artificial intelligence, volume 34, pp.  11957–11965, 2020.
  432. Photorealistic text-to-image diffusion models with deep language understanding. Advances in neural information processing systems, 35:36479–36494, 2022.
  433. Generative ai for transformative healthcare: A comprehensive study of emerging models, applications, case studies and limitations. IEEE Access, 2024.
  434. Ml-leaks: Model and data independent membership inference attacks and defenses on machine learning models. arXiv preprint arXiv:1806.01246, 2018.
  435. Improved techniques for training gans. Advances in neural information processing systems, 29, 2016.
  436. Raising the cost of malicious ai-powered image editing. arXiv preprint arXiv:2302.06588, 2023.
  437. Rome was built in 1776: A case study on factual correctness in knowledge-grounded response generation. arXiv preprint arXiv:2110.05456, 2021.
  438. Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891, 2019.
  439. Autextification: automatic text identification. Procesamiento del Lenguaje Natural. Jaén, Spain, 2023.
  440. Self-critiquing models for assisting human evaluators. arXiv preprint arXiv:2206.05802, 2022.
  441. Florian Schmidt. Generalization in generation: A closer look at exposure bias. arXiv preprint arXiv:1910.00292, 2019.
  442. Can machines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp.  1350–1361, 2022.
  443. Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  22522–22531, 2023.
  444. De-fake: Detection and attribution of fake images generated by text-to-image generation models. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp.  3418–3432, 2023.
  445. Generative artificial intelligence in finance. FinTech Notes, 2023(006), 2023.
  446. Scalable and transferable black-box jailbreaks for language models via persona modulation. arXiv preprint arXiv:2311.03348, 2023.
  447. Glaze: Protecting artists from style mimicry by {{\{{Text-to-Image}}\}} models. In 32nd USENIX Security Symposium (USENIX Security 23), pp.  2187–2204, 2023.
  448. Towards understanding sycophancy in language models. arXiv preprint arXiv:2310.13548, 2023.
  449. Jailbreak in pieces: Compositional adversarial attacks on multi-modal language models. arXiv preprint arXiv:2307.14539, 2023.
  450. Membership inference attacks against nlp classification models. In NeurIPS 2021 Workshop Privacy in Machine Learning, 2021.
  451. Prompt stealing attacks against text-to-image generation models. arXiv preprint arXiv:2302.09923, 2023.
  452. In-context pretraining: Language modeling beyond document boundaries. arXiv preprint arXiv:2310.10638, 2023.
  453. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. arXiv preprint arXiv:2010.15980, 2020.
  454. Membership inference attacks against machine learning models. In 2017 IEEE symposium on security and privacy (SP), pp.  3–18. IEEE, 2017.
  455. A comprehensive review of generative ai in healthcare. arXiv preprint arXiv:2310.00795, 2023.
  456. On the exploitability of instruction tuning. arXiv preprint arXiv:2306.17194, 2023.
  457. Sponge examples: Energy-latency attacks on neural networks. In 2021 IEEE European symposium on security and privacy (EuroS&P), pp.  212–231. IEEE, 2021.
  458. Retrieval augmentation reduces hallucination in conversation. arXiv preprint arXiv:2104.07567, 2021.
  459. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484–489, 2016.
  460. Aaditya K Singh and DJ Strouse. Tokenization counts: the impact of tokenization on arithmetic in frontier llms. arXiv preprint arXiv:2402.14903, 2024.
  461. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  462. Break it, imitate it, fix it: Robustness by generating human-like attacks. arXiv preprint arXiv:2310.16955, 2023.
  463. Healthprompt: A zero-shot learning paradigm for clinical natural language processing. In AMIA Annual Symposium Proceedings, volume 2022, pp.  972. American Medical Informatics Association, 2022.
  464. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pp.  2256–2265. PMLR, 2015.
  465. Release strategies and the social impacts of language models. arXiv preprint arXiv:1908.09203, 2019.
  466. Diffusion art or digital forgery? investigating data replication in diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  6048–6058, 2023.
  467. Information leakage in embedding models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp.  377–390, 2020.
  468. Generalized people diversity: Learning a human perception-aligned diversity representation for people images. arXiv preprint arXiv:2401.14322, 2024.
  469. Beyond memorization: Violating privacy via inference with large language models. arXiv preprint arXiv:2310.07298, 2023.
  470. On nmt search errors and model errors: Cat got your tongue? arXiv preprint arXiv:1908.10090, 2019.
  471. Rickrolling the artist: Injecting backdoors into text encoders for text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4584–4596, 2023.
  472. Detectllm: Leveraging log rank information for zero-shot detection of machine-generated text. arXiv preprint arXiv:2306.05540, 2023a.
  473. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355, 2023b.
  474. Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124, 2022.
  475. Noun-verb based technique of text watermarking using recursive decent semantic net parsers. In International Conference on Natural Computation, pp.  968–971. Springer, 2005.
  476. Vipergpt: Visual inference via python execution for reasoning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  11888–11898, 2023.
  477. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.
  478. Task ambiguity in humans and language models. arXiv preprint arXiv:2212.10711, 2022.
  479. Google Gemini Team. Gemini: A family of highly capable multimodal models, 2023.
  480. The hiding virtues of ambiguity: quantifiably resilient watermarking of natural language text through synonym substitutions. In Proceedings of the 8th workshop on Multimedia and security, pp.  164–174, 2006.
  481. Hugo et al. Touvron. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  482. Truth serum: Poisoning machine learning models to reveal their secrets. In Proceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security, pp.  2779–2792, 2022.
  483. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022.
  484. Systematic evaluation of backdoor data poisoning attacks on image classifiers. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pp.  788–789, 2020.
  485. Ring-a-bell! how reliable are concept removal methods for diffusion models? arXiv preprint arXiv:2310.10012, 2023.
  486. Clean-label backdoor attacks. 2018.
  487. Digital watermarking techniques for security applications. In 2016 International Conference on Emerging Trends in Electrical Electronics & Sustainable Energy Systems (ICETEESES), pp.  379–382. IEEE, 2016.
  488. On the state of the art in authorship attribution and authorship verification. arXiv preprint arXiv:2209.06869, 2022.
  489. Authorship attribution for neural text generation. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), pp.  8384–8395, 2020.
  490. Turingbench: A benchmark environment for turing test in the age of neural text generation. arXiv preprint arXiv:2109.13296, 2021.
  491. Embedding watermarks into deep neural networks. In Proceedings of the 2017 ACM on international conference on multimedia retrieval, pp.  269–277, 2017.
  492. Frustratingly easy edit-based linguistic steganography with a masked language model. arXiv preprint arXiv:2104.09833, 2021.
  493. Med-halt: Medical domain hallucination test for large language models. arXiv preprint arXiv:2307.15343, 2023.
  494. Nur Alya Afikah Usop and Syifak Izhar Hisham. A review of digital watermarking techniques, characteristics and attacks in text documents. Advances in Robotics, Automation and Data Analytics: Selected Papers from iCITES 2020, pp.  256–271, 2021.
  495. Anti-dreambooth: Protecting users from personalized text-to-image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2116–2127, 2023.
  496. Chatgpt: The transformative influence of generative ai on science and healthcare. Journal of Hepatology, 2023.
  497. A stitch in time saves nine: Detecting and mitigating hallucinations of llms by validating low-confidence generation. arXiv preprint arXiv:2307.03987, 2023.
  498. Yukti Varshney. Attacks on digital watermarks: classification, implications, benchmarks. Int J Emerg Technol (Special Issue NCETST-2017), 8(1):229–235, 2017.
  499. Using artificial intelligence in craft education: crafting with text-to-image generative models. Digital Creativity, 34(1):1–21, 2023.
  500. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  501. Bagm: A backdoor attack for manipulating text-to-image generative models. arXiv preprint arXiv:2307.16489, 2023.
  502. Fairpy: A toolkit for evaluation of social biases and their mitigation in large language models. arXiv preprint arXiv:2302.05508, 2023.
  503. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214, 2023.
  504. Concealed data poisoning attacks on nlp models. arXiv preprint arXiv:2010.12563, 2020.
  505. Poisoning language models during instruction tuning. arXiv preprint arXiv:2305.00944, 2023.
  506. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In 2019 IEEE Symposium on Security and Privacy (SP), pp.  707–723. IEEE, 2019.
  507. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models. arXiv preprint arXiv:2306.11698, 2023a.
  508. On exposure bias, hallucination and domain shift in neural machine translation. arXiv preprint arXiv:2005.03642, 2020.
  509. Progressive translation: Improving domain robustness of neural machine translation with intermediate sequences. arXiv preprint arXiv:2305.09154, 2023b.
  510. Semantic adversarial attacks via diffusion models. arXiv preprint arXiv:2309.07398, 2023c.
  511. Simac: A simple anti-customization method against text-to-image synthesis of diffusion models. arXiv preprint arXiv:2312.07865, 2023d.
  512. T2iat: Measuring valence and stereotypical biases in text-to-image generation. arXiv preprint arXiv:2306.00905, 2023e.
  513. On the robustness of chatgpt: An adversarial and out-of-distribution perspective. arXiv preprint arXiv:2302.12095, 2023f.
  514. Acquisition of a lexicon for family history information: Bidirectional encoder representations from transformers–assisted sublanguage analysis. JMIR Medical Informatics, 11:e48072, 2023g.
  515. Practical detection of trojan neural networks: Data-limited and data-free cases. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXIII 16, pp.  222–238. Springer, 2020a.
  516. Anti-forgery: Towards a stealthy and robust deepfake disruption attack via adversarial perceptual-aware perturbations. arXiv preprint arXiv:2206.00477, 2022.
  517. Cnn-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  8695–8704, 2020b.
  518. Evaluating data attribution for text-to-image models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  7192–7203, 2023h.
  519. Riga: Covert and robust white-box watermarking of deep neural networks. In Proceedings of the Web Conference 2021, pp.  993–1004, 2021.
  520. Yanqing Wang. Generative ai in operational risk management: Harnessing the future of finance. Operational Risk Management: Harnessing the Future of Finance (May 17, 2023), 2023.
  521. Defending llms against jailbreaking attacks via backtranslation. arXiv preprint arXiv:2402.16459, 2024a.
  522. Do-not-answer: A dataset for evaluating safeguards in llms. arXiv preprint arXiv:2308.13387, 2023i.
  523. Stop reasoning! when multimodal llms with chain-of-thought reasoning meets adversarial images. arXiv preprint arXiv:2402.14899, 2024b.
  524. Dire for diffusion-generated image detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  22445–22455, 2023j.
  525. Alteration-free and model-agnostic origin attribution of generated images. arXiv preprint arXiv:2305.18439, 2023k.
  526. Ryan Webster. A reproducible extraction of training images from diffusion models. arXiv preprint arXiv:2305.08694, 2023.
  527. This person (probably) exists. identity membership attacks against gan generated faces. arXiv preprint arXiv:2107.06018, 2021.
  528. Jailbroken: How does llm safety training fail? arXiv preprint arXiv:2307.02483, 2023a.
  529. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
  530. Simple synthetic data reduces sycophancy in large language models. arXiv preprint arXiv:2308.03958, 2023b.
  531. Jailbreak and guard aligned language models with only few in-context demonstrations. arXiv preprint arXiv:2310.06387, 2023c.
  532. Max Weiss. Deepfake bot submissions to federal public comment websites cannot be distinguished from human submissions. Technology Science, 2019121801, 2019.
  533. Hard prompts made easy: Gradient-based discrete optimization for prompt tuning and discovery. arXiv preprint arXiv:2302.03668, 2023a.
  534. Tree-ring watermarks: Fingerprints for diffusion images that are invisible and robust. arXiv preprint arXiv:2305.20030, 2023b.
  535. Gradient-based language model red teaming. arXiv preprint arXiv:2401.16656, 2024.
  536. Attacking adversarial attacks as a defense. arXiv preprint arXiv:2106.04938, 2021.
  537. Towards efficient adversarial training on vision transformers. In European Conference on Computer Vision, pp.  307–325. Springer, 2022a.
  538. Llmdet: A third party large language models generated text detection tool. In Findings of the Association for Computational Linguistics: EMNLP 2023, pp.  2113–2133, 2023a.
  539. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519, 2023b.
  540. Human preference score: Better aligning text-to-image models with human preference. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  2096–2105, 2023c.
  541. Membership inference attacks against text-to-image generation models. 2022b.
  542. Ask again, then fail: Large language models’ vacillations in judgement. arXiv preprint arXiv:2310.02174, 2023a.
  543. Defending chatgpt against jailbreak attack via self-reminders. Nature Machine Intelligence, pp.  1–11, 2023b.
  544. Measurement-conditioned denoising diffusion probabilistic model for under-sampled medical image reconstruction. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  655–664. Springer, 2022.
  545. Imagereward: Learning and evaluating human preferences for text-to-image generation. Advances in Neural Information Processing Systems, 36, 2024.
  546. Toward effective protection against diffusion-based mimicry through score distillation. In The Twelfth International Conference on Learning Representations, 2023.
  547. Minimalism is king! high-frequency energy-based screening for data-efficient backdoor attacks. IEEE Transactions on Information Forensics and Security, 2024.
  548. Bite: Textual backdoor attacks with iterative trigger injection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  12951–12968, 2023a.
  549. Backdooring instruction-tuned large language models with virtual prompt injection. In NeurIPS 2023 Workshop on Backdoors in Deep Learning-The Good, the Bad, and the Ugly, 2023b.
  550. Defending against gan-based deepfake attacks via transformation-aware adversarial faces. In 2021 international joint conference on neural networks (IJCNN), pp.  1–8. IEEE, 2021a.
  551. Artfusion: A diffusion model-based style synthesis framework for portraits. Electronics, 13(3):509, 2024a.
  552. Using human feedback to fine-tune diffusion models without any reward model. arXiv preprint arXiv:2311.13231, 2023a.
  553. Fudge: Controlled text generation with future discriminators. arXiv preprint arXiv:2104.05218, 2021.
  554. A new benchmark and reverse validation method for passage-level hallucination detection. arXiv preprint arXiv:2310.06498, 2023b.
  555. Rethinking stealthiness of backdoor attack against nlp models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.  5543–5557, 2021b.
  556. Dna-gpt: Divergent n-gram analysis for training-free detection of gpt-generated text. arXiv preprint arXiv:2305.17359, 2023c.
  557. Diffmic: Dual-guidance diffusion network for medical image classification. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pp.  95–105. Springer, 2023d.
  558. Mma-diffusion: Multimodal attack on diffusion models. arXiv preprint arXiv:2311.17516, 2023e.
  559. Sneakyprompt: Jailbreaking text-to-image generative models. In 2024 IEEE Symposium on Security and Privacy (SP), pp.  123–123. IEEE Computer Society, 2024b.
  560. Disrupting image-translation-based deepfake algorithms with adversarial attacks. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision Workshops, pp.  53–62, 2020.
  561. Privacy risk in machine learning: Analyzing the connection to overfitting. In 2018 IEEE 31st computer security foundations symposium (CSF), pp.  268–282. IEEE, 2018.
  562. Woodpecker: Hallucination correction for multimodal large language models. arXiv preprint arXiv:2310.16045, 2023a.
  563. Do large language models know what they don’t know? arXiv preprint arXiv:2305.18153, 2023b.
  564. A zero-watermarking scheme for prose writings. In 2017 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery (CyberC), pp.  276–282. IEEE, 2017.
  565. Transformers for multi-label classification of medical text: an empirical comparison. In International Conference on Artificial Intelligence in Medicine, pp.  114–123. Springer, 2021.
  566. Attributing fake images to gans: Learning and analyzing gan fingerprints. In Proceedings of the IEEE/CVF international conference on computer vision, pp.  7556–7566, 2019.
  567. Responsible disclosure of generative models using scalable fingerprinting. arXiv preprint arXiv:2012.08726, 2020.
  568. Artificial fingerprinting for generative models: Rooting deepfake attribution in training data. In Proceedings of the IEEE/CVF International conference on computer vision, pp.  14448–14457, 2021a.
  569. A survey on deepfake video detection. Iet Biometrics, 10(6):607–624, 2021b.
  570. Hallucidoctor: Mitigating hallucinatory toxicity in visual instruction data. arXiv preprint arXiv:2311.13614, 2023a.
  571. Improving language models via plug-and-play retrieval feedback. arXiv preprint arXiv:2305.14002, 2023b.
  572. Reliable evaluation of adversarial transferability. arXiv preprint arXiv:2306.08565, 2023c.
  573. Gpt-4 is too smart to be safe: Stealthy chat with llms via cipher. arXiv preprint arXiv:2308.06463, 2023.
  574. Rigorllm: Resilient guardrails for large language models against undesired content. arXiv preprint arXiv:2403.13031, 2024.
  575. Analyzing information leakage of updates to natural language models. In Proceedings of the 2020 ACM SIGSAC conference on computer and communications security, pp.  363–375, 2020.
  576. Narcissus: A practical clean-label backdoor attack with limited information. In Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pp.  771–785, 2023.
  577. Autodefense: Multi-agent llm defense against jailbreak attacks. arXiv preprint arXiv:2403.04783, 2024.
  578. Text-to-image diffusion models can be easily backdoored through multimodal data poisoning. In Proceedings of the 31st ACM International Conference on Multimedia, pp.  1577–1587, 2023.
  579. One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era. arXiv preprint arXiv:2304.06488, 2023a.
  580. Counterfactual memorization in neural language models. arXiv preprint arXiv:2112.12938, 2021.
  581. Protecting intellectual property of deep neural networks with watermarking. In Proceedings of the 2018 on Asia conference on computer and communications security, pp.  159–172, 2018.
  582. On the robustness of latent diffusion models. arXiv preprint arXiv:2306.08257, 2023b.
  583. Adding conditional control to text-to-image diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  3836–3847, 2023c.
  584. How language model hallucinations can snowball. arXiv preprint arXiv:2305.13534, 2023d.
  585. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
  586. Adversarial attacks on deep-learning models in natural language processing: A survey. ACM Transactions on Intelligent Systems and Technology (TIST), 11(3):1–41, 2020.
  587. A mutation-based method for multi-modal jailbreaking attack detection. arXiv preprint arXiv:2312.10766, 2023e.
  588. Siren’s song in the ai ocean: A survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023f.
  589. Explainability for large language models: A survey. ACM Transactions on Intelligent Systems and Technology, 15(2):1–38, 2024.
  590. Men also like shopping: Reducing gender bias amplification using corpus-level constraints. arXiv preprint arXiv:1707.09457, 2017.
  591. Verify-and-edit: A knowledge-enhanced chain-of-thought framework. arXiv preprint arXiv:2305.03268, 2023a.
  592. Clean-label backdoor attacks on video recognition models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  14443–14452, 2020.
  593. Prompt as triggers for backdoor attack: Examining the vulnerability in language models. arXiv preprint arXiv:2305.01219, 2023b.
  594. Calibrating sequence likelihood improves conditional language generation. In The Eleventh International Conference on Learning Representations, 2022.
  595. A recipe for watermarking diffusion models. arXiv preprint arXiv:2303.10137, 2023c.
  596. On evaluating adversarial robustness of large vision-language models. arXiv preprint arXiv:2305.16934, 2023d.
  597. Unlearnable examples for diffusion models: Protect data from unauthorized exploitation. arXiv preprint arXiv:2306.01902, 2023e.
  598. Can protective perturbation safeguard personal data from being exploited by stable diffusion? arXiv preprint arXiv:2312.00084, 2023f.
  599. Intriguing properties of data attribution on diffusion models. arXiv preprint arXiv:2311.00500, 2023.
  600. Detecting hallucinated content in conditional neural sequence generation. arXiv preprint arXiv:2011.02593, 2020.
  601. A transformer-based representation-learning model with unified processing of multimodal input for clinical diagnostics. Nature Biomedical Engineering, 7(6):743–755, 2023.
  602. Two-stream neural networks for tampered face detection. In 2017 IEEE conference on computer vision and pattern recognition workshops (CVPRW), pp.  1831–1839. IEEE, 2017.
  603. Duwak: Dual watermarks in large language models. arXiv preprint arXiv:2403.13000, 2024.
  604. Hidden: Hiding data with deep networks. In Proceedings of the European conference on computer vision (ECCV), pp.  657–672, 2018.
  605. Promptbench: Towards evaluating the robustness of large language models on adversarial prompts. arXiv preprint arXiv:2306.04528, 2023.
  606. A pilot study of query-free adversarial attack against stable diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  2384–2391, 2023.
  607. Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Jindong Gu (101 papers)
Citations (7)