Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
GPT-4o
Gemini 2.5 Pro Pro
o3 Pro
GPT-4.1 Pro
DeepSeek R1 via Azure Pro
2000 character limit reached

Controlling Language and Diffusion Models by Transporting Activations (2410.23054v2)

Published 30 Oct 2024 in cs.LG, cs.AI, cs.CL, and cs.CV

Abstract: The increasing capabilities of large generative models and their ever more widespread deployment have raised concerns about their reliability, safety, and potential misuse. To address these issues, recent works have proposed to control model generation by steering model activations in order to effectively induce or prevent the emergence of concepts or behaviors in the generated output. In this paper we introduce Activation Transport (AcT), a general framework to steer activations guided by optimal transport theory that generalizes many previous activation-steering works. AcT is modality-agnostic and provides fine-grained control over the model behavior with negligible computational overhead, while minimally impacting model abilities. We experimentally show the effectiveness and versatility of our approach by addressing key challenges in LLMs and text-to-image diffusion models (T2Is). For LLMs, we show that AcT can effectively mitigate toxicity, induce arbitrary concepts, and increase their truthfulness. In T2Is, we show how AcT enables fine-grained style control and concept negation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (57)
  1. Improving image generation with better captions. Computer Science., 2(3):8, 2023.
  2. The stable artist: Steering semantics in diffusion latent space. arXiv preprint arXiv:2212.06013, 2022.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Controllable generation with text-to-image diffusion models: A survey. arXiv preprint arXiv:2403.04279, 2024.
  5. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325, 2015.
  6. Directly fine-tuning diffusion models on differentiable rewards. arXiv preprint arXiv:2309.17400, 2023.
  7. Controlled text generation via language model arithmetic. arXiv preprint arXiv:2311.14479, 2023.
  8. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  9. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-first International Conference on Machine Learning, 2024.
  10. C. Fellbaum. WordNet: An Electronic Lexical Database. Language, Speech and Communication. Mit Press, 1998.
  11. Concept sliders: Lora adaptors for precise control in diffusion models. arXiv preprint arXiv:2311.12092, 2023.
  12. Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462, 2020.
  13. Measuring massive multitask language understanding. Proceedings of the International Conference on Learning Representations (ICLR), 2021.
  14. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.  7514–7528, 2021.
  15. Parameter-efficient transfer learning for nlp. In International conference on machine learning, pp.  2790–2799. PMLR, 2019.
  16. On effects of steering latent representation for large language model unlearning. arXiv preprint arXiv:2408.06223, 2024.
  17. Do not think pink elephant! arXiv preprint arXiv:2404.15154, 2024.
  18. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  19. Scedit: Efficient and controllable image diffusion generation via skip connection editing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8995–9004, 2024.
  20. Understanding catastrophic forgetting in language models via implicit inference. 2024.
  21. Direct consistency optimization for compositional text-to-image personalization. arXiv preprint arXiv:2402.12004, 2024.
  22. Inference-time intervention: Eliciting truthful answers from a language model. Advances in Neural Information Processing Systems, 36, 2024.
  23. Get what you want, not what you don’t: Image content suppression for text-to-image diffusion models. In The Twelfth International Conference on Learning Representations.
  24. Sdxl-lightning: Progressive adversarial diffusion distillation. arXiv e-prints, pp.  arXiv–2402, 2024.
  25. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  26. An empirical study of catastrophic forgetting in large language models during continual fine-tuning, 2023.
  27. Robert J McCann. A convexity principle for interacting gases. Advances in mathematics, 128(1):153–179, 1997.
  28. T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.  4296–4304, 2024.
  29. Steered diffusion: A generalized framework for plug-and-play conditional image synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  20850–20860, 2023.
  30. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  31. Computational Optimal Transport. Foundations and Trends in Machine Learning, 11(5-6), 2019. ISSN 1935-8245.
  32. Sdxl: Improving latent diffusion models for high-resolution image synthesis. In The Twelfth International Conference on Learning Representations.
  33. Learning to generate reviews and discovering sentiment. arXiv preprint arXiv:1704.01444, 2017.
  34. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  35. Controlling large language model agents with entropic activation steering. arXiv preprint arXiv:2406.00244, 2024.
  36. Steering llama 2 via contrastive activation addition. arXiv preprint arXiv:2312.06681, 2023.
  37. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  38. Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  22500–22510, 2023.
  39. Filippo Santambrogio. Optimal transport for applied mathematicians. Birkäuser, NY, 55(58-63):94, 2015.
  40. Just “onesec” for producing multilingual sense-annotated data. pp.  699–709, 01 2019. doi: 10.18653/v1/P19-1069.
  41. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. ICLR, 2024.
  42. Ctrloralter: Conditional loradapter for efficient 0-shot control & altering of t2i models. arXiv preprint arXiv:2405.07913, 2024.
  43. Self-conditioning pre-trained language models. In International Conference on Machine Learning, pp.  4455–4473. PMLR, 2022.
  44. Whispering experts: Neural interventions for toxicity mitigation in language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=2P6GVfSrfZ.
  45. Gemma 2: Improving open language models at a practical size. arXiv preprint arXiv:2408.00118, 2024.
  46. Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248, 2023.
  47. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  48. Diffusion model alignment using direct preference optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8228–8238, 2024.
  49. Finetuned language models are zero-shot learners. In International Conference on Learning Representations.
  50. Unveiling the implicit toxicity in large language models. pp.  1322–1338. Association for Computational Linguistics, December 2023.
  51. ReFT: Representation finetuning for language models. 2024. URL arxiv.org/abs/2404.03592.
  52. Using human feedback to fine-tune diffusion models without any reward model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.  8941–8951, 2024.
  53. Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models. arXiv preprint arXiv:2308.06721, 2023.
  54. Navigating text-to-image customization: From lyCORIS fine-tuning to model evaluation. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=wfzXa8e783.
  55. Uni-controlnet: All-in-one control to text-to-image diffusion models. Advances in Neural Information Processing Systems, 36, 2024.
  56. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623, 2023.
  57. Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405, 2023.

Summary

  • The paper presents the Activation Transport (AcT) framework to control model activations, reducing toxicity by up to 7.5 times and enhancing truthfulness.
  • It leverages optimal transport maps to minimally adjust activations, preserving inherent model performance across language and image tasks.
  • The method offers a unified, computationally efficient approach with implications for advanced safety, content moderation, and AI alignment in regulated industries.

Overview of "Controlling Language and Diffusion Models by Transporting Activations"

The paper "Controlling Language and Diffusion Models by Transporting Activations" addresses the pertinent issue of controlling generative models to enhance their reliability and prevent misuse. As generative models continue to grow in scale and application, there is significant concern regarding their alignment and control mechanisms, especially given the computational and memory challenges associated with traditional fine-tuning methods. The proposed solution, Activation Transport (AcT), leverages optimal transport theory to provide fine-grained control over model behaviors in a computationally efficient manner.

Activation Transport (AcT) Framework

AcT serves as a general framework for steering model activations across different modalities. By utilizing optimal transport maps, AcT is designed to make minimal interference with the model's inherent capabilities. The key innovation here is using activation transportation as a modality-agnostic strategy to preserve the distribution of activations observed during training, thus maintaining model robustness and performance across tasks.

Evaluation and Results

The paper evaluates AcT by addressing various challenges present in LLMs and Text-to-Image diffusion models (T2Is). For LLMs, AcT effectively reduces toxicity and enhances truthfulness while inducing arbitrary concepts. The experiments conducted reveal that AcT consistently achieves the intended control objectives, outperforming several prior methods. For instance, using AcT with LLMs resulted in up to 7.5 times reduction in toxicity. The method accomplished similar success in inducing concepts such as improving the LLM's performance on the TruthfulQA benchmark.

In the domain of T2Is, the approach demonstrated fine-grained control over styles and facilitated concept negation. AcT provided direct manipulation of style conditions in text-to-image generation, showcasing superior performance compared to existing methods like ITI-c, by achieving the desired conditioning at consistent transport strengths.

Implications and Future Directions

The implications of AcT are twofold: technical and application-focused. From a technical standpoint, AcT proposes a unifying solution that reduces computational overhead during inference while achieving robust model alignment. The use of optimal transport theory introduces mathematical rigor, offering more interpretable and reliable control operations across different models and tasks.

Practically, the method's applicability across modalities (such as language and visual models) poses a significant advantage for developing safety measures and customization features in AI systems. This dual applicability could pave the way for enhancing commercial applications such as content moderation, personalized content generation, and AI model alignment in regulated industries.

Looking forward, the paper opens avenues for exploring non-linear transport methodologies, which could offer even greater control and fidelity in active model alignment tasks. Enhancing the sample efficiency and non-linear map capability would further refine activation control methodologies, making them more adaptable to complex patterns within high-dimensional activation spaces.

Conclusion

In summary, "Controlling Language and Diffusion Models by Transporting Activations" presents a coherent and effective strategy for model control, setting the stage for more precise and computationally viable exploitation of generative models. The employment of optimal transport maps captures a novel dimensionality in exploring AI alignment techniques, making a substantial contribution to the field's ongoing discourse on safety and reliability in AI deployments.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.