Controlling Language and Diffusion Models by Transporting Activations (2410.23054v2)
Abstract: The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in LLMs and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.
- Improving image generation with better captions. Computer Science., 2(3):8, 2023.
- The stable artist: Steering semantics in diffusion latent space. arXiv preprint arXiv:2212.06013, 2022.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Controllable generation with text-to-image diffusion models: A survey. arXiv preprint arXiv:2403.04279, 2024.
- Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
- Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023.
- Controlled text generation via language model arithmetic. arXiv preprint arXiv:2311.14479, 2023.
- The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
- Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
- C. Fellbaum. WordNet: An Electronic Lexical Database. Language, Speech and Communication. Mit Press, 1998.
- Concept sliders: Lora adaptors for precise control in diffusion models. arXiv preprint arXiv:2311.12092, 2023.
- Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
- Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
- Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 7514–7528, 2021.
- Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp. 2790–2799. PMLR, 2019.
- On effects of steering latent representation for large language model unlearning. arXiv preprint arXiv:2408.06223, 2024.
- Do not think pink elephant! arXiv preprint arXiv:2404.15154, 2024.
- Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
- Scedit: Efficient and controllable image diffusion generation via skip connection editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8995–9004, 2024.
- Understanding catastrophic forgetting in language models via implicit inference. 2024.
- Direct consistency optimization for compositional text-to-image personalization. arXiv preprint arXiv:2402.12004, 2024.
- Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
- Get what you want, not what you don’t: Image content suppression for text-to-image diffusion models. In The Twelfth International Conference on Learning Representations.
- Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv e-prints, pp. arXiv–2402, 2024.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.
- Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp. 4296–4304, 2024.
- Steered diffusion: A generalized framework for plug-and-play conditional image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 20850–20860, 2023.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Computational Optimal Transport. Foundations and Trends in Machine Learning, 11(5-6), 2019. ISSN 1935-8245.
- Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations.
- Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Controlling large language model agents with entropic activation steering. arXiv preprint arXiv:2406.00244, 2024.
- Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 22500–22510, 2023.
- Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
- Just “onesec” for producing multilingual sense-annotated data. pp. 699–709, 01 2019. doi: 10.18653/v1/P19-1069.
- Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. ICLR, 2024.
- Ctrloralter: Conditional loradapter for efficient 0-shot control & altering of t2i models. arXiv preprint arXiv:2405.07913, 2024.
- Self-conditioning pre-trained language models. In International Conference on Machine Learning, pp. 4455–4473. PMLR, 2022.
- Whispering experts: Neural interventions for toxicity mitigation in language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=2P6GVfSrfZ.
- Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8228–8238, 2024.
- Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
- Unveiling the implicit toxicity in large language models. pp. 1322–1338. Association for Computational Linguistics, December 2023.
- ReFT: Representation finetuning for language models. 2024. URL arxiv.org/abs/2404.03592.
- Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8941–8951, 2024.
- Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
- Navigating text-to-image customization: From lyCORIS fine-tuning to model evaluation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wfzXa8e783.
- Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
- Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.