InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance (2401.11206v1)
Abstract: With the rapid development of LLMs, they are not only used as general-purpose AI assistants but are also customized through further fine-tuning to meet the requirements of different applications. A pivotal factor in the success of current LLMs is the alignment process. Current alignment methods, such as supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), focus on training-time alignment and are often complex and cumbersome to implement. Therefore, we develop \textbf{InferAligner}, a novel inference-time alignment method that utilizes cross-model guidance for harmlessness alignment. InferAligner utilizes safety steering vectors extracted from safety-aligned model to modify the activations of the target model when responding to harmful inputs, thereby guiding the target model to provide harmless responses. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics, as well as to multimodal LLMs (MLLMs) such as LLaVA. It significantly diminishes the Attack Success Rate (ASR) of both harmful instructions and jailbreak attacks, while maintaining almost unchanged performance in downstream tasks.
- A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862.
- Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions. arXiv preprint arXiv:2309.07875.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023).
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Toxicity in chatgpt: Analyzing persona-assigned language models. arXiv preprint arXiv:2304.05335.
- Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233.
- Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
- Inspecting and editing knowledge representations in language models. arXiv preprint arXiv:2304.00740.
- Instruct2act: Mapping multi-modality instructions to robotic actions with large language model. arXiv preprint arXiv:2305.11176.
- Catastrophic jailbreak of open-source llms via exploiting generation. arXiv preprint arXiv:2310.06987.
- What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
- Multi-step jailbreaking privacy attacks on chatgpt. arXiv preprint arXiv:2304.05197.
- Inference-time intervention: Eliciting truthful answers from a language model, july 2023. URL http://arxiv. org/abs/2306.03341.
- Rain: Your language models can align themselves without finetuning. arXiv preprint arXiv:2309.07124.
- The unlocking spell on base llms: Rethinking alignment via in-context learning. arXiv preprint arXiv:2312.01552.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958.
- Visual instruction tuning.
- Jailbreaking chatgpt via prompt engineering: An empirical study. arXiv preprint arXiv:2305.13860.
- Www’18 open challenge: financial opinion mining and question answering. In Companion proceedings of the the web conference 2018, pages 1941–1942.
- Good debt or bad debt: Detecting semantic orientations in economic texts. Journal of the Association for Information Science and Technology, 65(4):782–796.
- OpenAI. 2023. Gpt-4 technical report.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
- Bbq: A hand-built bias benchmark for question answering. arXiv preprint arXiv:2110.08193.
- Gpt-3.5 turbo fine-tuning and api updates.
- Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Extracting latent steering vectors from pretrained language models. arXiv preprint arXiv:2205.05124.
- InternLM Team. 2023. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM.
- Llama: Open and efficient foundation language models.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Activation addition: Steering language models without optimization. arXiv preprint arXiv:2308.10248.
- Seqxgpt: Sentence-level ai-generated text detection. arXiv preprint arXiv:2310.08903.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Defending chatgpt against jailbreak attack via self-reminder.
- Fingpt: Open-source financial large language models. arXiv preprint arXiv:2306.06031.
- Gptfuzzer: Red teaming large language models with auto-generated jailbreak prompts. arXiv preprint arXiv:2309.10253.
- Safetybench: Evaluating the safety of large language models with multiple choice questions. arXiv preprint arXiv:2309.07045.
- Defending large language models against jailbreaking attacks through goal prioritization. arXiv preprint arXiv:2311.09096.
- Judging llm-as-a-judge with mt-bench and chatbot arena. arXiv preprint arXiv:2306.05685.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206.
- Representation engineering: A top-down approach to ai transparency. arXiv preprint arXiv:2310.01405.
- Universal and transferable adversarial attacks on aligned language models. arXiv preprint arXiv:2307.15043.
- Pengyu Wang (63 papers)
- Dong Zhang (169 papers)
- Linyang Li (57 papers)
- Chenkun Tan (3 papers)
- Xinghao Wang (15 papers)
- Ke Ren (17 papers)
- Botian Jiang (8 papers)
- Xipeng Qiu (257 papers)