Papers
Topics
Authors
Recent
2000 character limit reached

Focus On This, Not That! Steering LLMs with Adaptive Feature Specification (2410.22944v4)

Published 30 Oct 2024 in cs.LG, cs.AI, and cs.CL

Abstract: Despite the success of Instruction Tuning (IT) in training LLMs, such models often leverage spurious or biased features learnt from their training data and can become misaligned, leading to undesired behaviours. While existing techniques can steer model behaviour at inference-time, they are often post-hoc and do not embed steering as an intrinsic model feature. In this work, we introduce Focus Instruction Tuning (FIT), which trains LLMs to condition their responses by focusing on specific features whilst ignoring others, leading to different behaviours based on what features are specified. Across diverse benchmarks, we demonstrate that FIT: (i) successfully steers behaviour at inference time; (ii) increases robustness by amplifying core task signals and down-weighting spurious cues; (iii) mitigates social bias by suppressing demographic attributes; and (iv) generalises under distribution shifts and to previously unseen focus features. FIT therefore offers a lightweight, intrinsic mechanism for building more robust, fair, and easily controllable LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (69)
  1. Guided open vocabulary image captioning with constrained beam search. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp.  936–945, 2017.
  2. Invariant risk minimization. arXiv preprint arXiv:1907.02893, 2019.
  3. Deep convolutional networks do not classify based on global object shape. PLoS computational biology, 14(12):e1006613, 2018.
  4. Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances. Computational Linguistics, 48(1):207–219, March 2022. doi: 10.1162/coli˙a˙00422. URL https://aclanthology.org/2022.cl-1.7.
  5. Leace: Perfect linear concept erasure in closed form. Advances in Neural Information Processing Systems, 36, 2024.
  6. Pattern recognition and machine learning, volume 4. Springer, 2006.
  7. Nuanced metrics for measuring unintended bias with real data for text classification. In Companion proceedings of the 2019 world wide web conference, pp.  491–500, 2019.
  8. Measuring the reliability of causal probing methods: Tradeoffs, limitations, and the plight of nullifying interventions. arXiv preprint arXiv:2408.15510, 2024.
  9. Self-play fine-tuning converts weak language models to strong language models. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=O4cHTxW9BS.
  10. Understanding and mitigating spurious correlations in text classification with neighborhood analysis. In Findings of the Association for Computational Linguistics: EACL 2024, pp.  1013–1025, 2024.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  12. Towards automated circuit discovery for mechanistic interpretability. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.), Advances in Neural Information Processing Systems, volume 36, pp.  16318–16352. Curran Associates, Inc., 2023. URL https://proceedings.neurips.cc/paper_files/paper/2023/file/34e1dbe95d34d7ebaf99b9bcaeb5b2be-Paper-Conference.pdf.
  13. The cognitive revolution in interpretability: From explaining behavior to interpreting representations and algorithms. arXiv preprint arXiv:2408.05859, 2024.
  14. Competence-based analysis of language models. arXiv preprint arXiv:2303.00333, 2023.
  15. Free Dolly. Introducing the world’s first truly open instruction-tuned llm. databricks. com, 2023.
  16. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  17. Interpreting CLIP’s image representation via text-based decomposition. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=5Ca9sSzuDp.
  18. Jsontuning: Towards generalizable, robust, and controllable instruction tuning. arXiv preprint arXiv:2310.02953, 2023.
  19. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In International Conference on Learning Representations, 2018.
  20. A closer look at the limitations of instruction tuning. arXiv preprint arXiv:2402.05119, 2024.
  21. The false promise of imitating proprietary llms. arXiv preprint arXiv:2305.15717, 2023.
  22. On the foundations of shortcut learning. In The Twelfth International Conference on Learning Representations, 2024.
  23. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  24. Learning preference model for llms via automatic preference data generation. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  9187–9199, 2023.
  25. Llama guard: Llm-based input-output safeguard for human-ai conversations. arXiv preprint arXiv:2312.06674, 2023.
  26. Shielded representations: Protecting sensitive attributes through iterative gradient-based projection. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Findings of the Association for Computational Linguistics: ACL 2023, pp.  5961–5977, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-acl.369. URL https://aclanthology.org/2023.findings-acl.369.
  27. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  28. Causal machine learning: A survey and open problems. arXiv preprint arXiv:2206.15475, 2022.
  29. Wilds: A benchmark of in-the-wild distribution shifts. In International conference on machine learning, pp.  5637–5664. PMLR, 2021.
  30. Do models really learn to follow instructions? an empirical study of instruction tuning. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  1317–1328, 2023.
  31. Inference-time selective debiasing. arXiv preprint arXiv:2407.19345, 2024.
  32. Probing for the usage of grammatical number. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  8818–8831, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-long.603. URL https://aclanthology.org/2022.acl-long.603.
  33. The flan collection: Designing data and methods for effective instruction tuning. In International Conference on Machine Learning, pp.  22631–22648. PMLR, 2023.
  34. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019.
  35. Language models are weak learners. Advances in Neural Information Processing Systems, 36:50907–50931, 2023.
  36. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024.
  37. RT McCoy. Right for the wrong reasons: Diagnosing syntactic heuristics in natural language inference. arXiv preprint arXiv:1902.01007, 2019.
  38. On the impact of spurious correlation for out-of-distribution detection. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  10051–10059, 2022.
  39. Cross-task generalization via natural language crowdsourcing instructions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  3470–3487, 2022.
  40. Orca 2: Teaching small language models how to reason. arXiv preprint arXiv:2311.11045, 2023.
  41. OpenAI. Chatgpt, 2022. URL https://openai.com/chatgpt/. Accessed: 2023-09-03.
  42. Training language models to follow instructions with human feedback. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pp.  27730–27744, 2022.
  43. BBQ: A hand-built bias benchmark for question answering. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.  2086–2105, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.findings-acl.165. URL https://aclanthology.org/2022.findings-acl.165.
  44. Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.  7237–7256, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.647. URL https://aclanthology.org/2020.acl-main.647.
  45. Counterfactual interventions reveal the causal effect of relative clause representations on agreement prediction. In Arianna Bisazza and Omri Abend (eds.), Proceedings of the 25th Conference on Computational Natural Language Learning, pp.  194–209, Online, November 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.conll-1.15. URL https://aclanthology.org/2021.conll-1.15.
  46. Adversarial concept erasure in kernel space. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  6034–6055, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. URL https://aclanthology.org/2022.emnlp-main.405.
  47. Log-linear guardedness and its implications. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  9413–9431, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.523. URL https://aclanthology.org/2023.acl-long.523.
  48. Recursive deep models for semantic compositionality over a sentiment treebank. In David Yarowsky, Timothy Baldwin, Anna Korhonen, Karen Livescu, and Steven Bethard (eds.), Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp.  1631–1642, Seattle, Washington, USA, October 2013a. Association for Computational Linguistics. URL https://aclanthology.org/D13-1170.
  49. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 conference on empirical methods in natural language processing, pp.  1631–1642, 2013b.
  50. Spine: Sparse interpretable neural embeddings. In Proceedings of the AAAI conference on artificial intelligence, volume 32, 2018.
  51. Evaluating the zero-shot robustness of instruction-tuned language models. In International Conference on Learning Representations. ICLR, 2024a.
  52. Exploring and mitigating shortcut learning for generative large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pp.  6883–6893, 2024b.
  53. Alpaca: A strong, replicable instruction-following model. Stanford Center for Research on Foundation Models. https://crfm. stanford. edu/2023/03/13/alpaca. html, 3(6):7, 2023.
  54. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  55. Mitigating spurious correlation in natural language understanding with counterfactual inference. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  11308–11321, 2022.
  56. Causal feature alignment: Learning to ignore spurious background features. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp.  4666–4674, 2024.
  57. Generalizing to unseen domains: A survey on domain generalization. IEEE Transactions on Knowledge and Data Engineering, 35(8):8052–8072, 2023a. doi: 10.1109/TKDE.2022.3178128.
  58. Interpretability in the wild: a circuit for indirect object identification in GPT-2 small. In The Eleventh International Conference on Learning Representations, 2023b. URL https://openreview.net/forum?id=NpsVSN6o4ul.
  59. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023c. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  60. Identifying spurious correlations for robust text classification. arXiv preprint arXiv:2010.02458, 2020.
  61. Robustness to spurious correlations in text classification via automatically generated counterfactuals. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  14024–14031, 2021.
  62. A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp.  1112–1122, 2018.
  63. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations, pp.  38–45, 2020.
  64. Noise or signal: The role of image backgrounds in object recognition. In International Conference on Learning Representations, 2021.
  65. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  66. Spurious correlations in machine learning: A survey. arXiv preprint arXiv:2402.12715, 2024.
  67. Transformer visualization via dictionary learning: contextualized embedding as a linear superposition of transformer factors. In Eneko Agirre, Marianna Apidianaki, and Ivan Vulić (eds.), Proceedings of Deep Learning Inside Out (DeeLIO): The 2nd Workshop on Knowledge Extraction and Integration for Deep Learning Architectures, pp.  1–10, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.deelio-1.1. URL https://aclanthology.org/2021.deelio-1.1.
  68. Instruction tuning for large language models: A survey. arXiv preprint arXiv:2308.10792, 2023.
  69. Lima: Less is more for alignment. Advances in Neural Information Processing Systems, 36, 2024.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.